some months ago I got in my application TCP Zero Copy working, thanks to Hein and some contributors in the forum. Now I have tried the same with UDP Zero Copy to gain speed and it seems that randomly the payloads get corrupt. It happens very infrequently, around 1 time every 2 million packets. My application is very packet loss sensitive and even this packet every couple of million packets creates a problem. When using non zero copy I have no payload corruption issues at all.
Some background info to my application:
1- I have a dedicated ethernet (double shielded) connection between my computer where the application receives data and the STM32H7 acquisition board. I see no packet loss at all, just payload corruption
2- For UDP Zero Copy I preassign some buffers for the payloads in advance via FreeRTOS_GetUDPPayloadBuffer
3- I fill in the data into the payload buffers in a SAI ISR and send the packets in a task. I use a task notification system and make sure that the SAI ISR never overwrites a payload which is being sent by the task
4- I use BufferAllocation_1 and heap 4
5- My FreeRTOS+TCP is up-to-date
It seems that somebody had a similar issue in the past:
Any idea on how to resolve this?. I really doubt that it is an issue of my application, as I have almost no differences in code between Zero Copy and Non-Zero Copy.
I would really like to be able to solve the issue as UDP Zero Copy is noticiable faster.
Hi, the corruption happens in the payload not into the UDP header. The packet itself seems to be fine, as it is properly received by my application in Labview where I gather the packets and process them. I notice the corruption both in Labview as well as when debugging the STM32H7, as I have a packet counter integrated into the first 4 bytes of the payload and I can stop exactly when I realized that the sequence is broken. The packet counter (as well as the rest of the payload) gets some random data. I could gather a packet and show the payload to you but I do not think you will be able to extract any valuable info out of it.
I have already verified that the issue does not happen during the sending process. To do so:
1- I’ve set a breakpoint before the packet is sent that gets triggered just if the packet counter in the payload is not plausible.
2- I send the payload.
3- I’ve set a breakpoint after sending the payload looking again as in point 1 to the packet counter corruption.
The packet counter was corrupted both before as well as after the payload was sent. I will repeat the test today in the evening to assure that it is the data is fully equally corrupted both and after sending the packet but I think the problem happens somewhere else.
When you say before, do you mean that the corruption happens even before you call the UDP send? If yes, then the corruption is likely in the application.
A common technique to catch memory corruption is place a variable next to the one getting corruption and put a data breakpoint on that. Is it possible for you to do that? Also, can you share some code snippet to help us understand what memory is getting corrupted?
I will do the test today in the afternoon but I doubt it comes from the application as it does not touch the eth buffer array from FreeRTOS+TCP but in an ISR for zero copy transfers. When doing non zero copy I use my own application buffer array and this one never gets corrupted and it is written in the exact same was from the ISR
Good point Hartmut, I will check how did I setup the memory protection areas. I thought in any case when doing non zero copy my local buffer would in any case be copied in the Ethernet buffer structure eventually, so I should see cache problems as well there if the issue is coherency and the area is cacheable. I will have a look today and give feedback
You call FreeRTOS_sendto() with the FREERTOS_ZERO_COPY flag. And since you do so, you find that the contents of the payload buffer gets corrupted.
That surprise me because there is a minor difference between using FREERTOS_ZERO_COPY or not.
When the flag is not set, a network buffer is created and your buffer will be copied to that network buffer.
Like Hartmut, I also thought about cacheable memory.
When using zero-copy, the SAI will write to a different region of RAM. Can you tell what memory banks are involved. The ETH peripheral has complete access to that memory, but does the SAI also have full access? Does that work reliably?
I remember that the STM32H7 has many types of memories, which confused me quite a bit.
Hi Hein, thanks for your input as well. Further to what I have described upwards:
1- I have done a kind of radical test by disabling the Data Cache (everything was very very slow, I have commented out SCB_EnableDCache() in main.c) and I still see corrupted payloads
2- The SAI DMA has its own very small buffers in a region that can be accessed by it. I copy manually those small buffers in parts of the payload packets in each SAI ISR. Due to this reason it does not really matter in which memory region the ethernet buffers are (from the point of view of SAI).
As I have no issue at all when using my own buffers w/o Zero Copy, I suspect the corruption could come from something that is happening in the ethernet buffer (ucNetworkPackets in ethernet_data section starting on 0x24040000), not when I send the payloads but somewhere else. As mentioned, everything works typically well, it is just a matter of 1 corrupted payload every 2 million packets sent (in my application that creates a big issue though).
I will modify a little bit the code and do parallel buffering both on the ethernet buffer as well as on my own one in the SAI ISRs and compare them once I detect a payload corruption. I will let you know about the results.
thanks to all for your comments and support!. I have experimented a little bit since my last post, but unfortunately did not manage to solve it (I am a little bit out of my depth with this topic).
1- Moved the ethernet_data section from the DTCRAM section to D2 RAM → Same issue. In all those areas data cache was disabled
2- Increased the heap and stack size strongly (I think it worked, still need to verify) → Same issue
3- Created a parallel buffer which is being filled in with the same data in the SAI ISR just after the ethernet one → Very interesting here. Both data buffers get partially corrupted once every couple of million packets, but in different areas. I show you a capture below:
BaseBuffAddress is an array of pointers that points to the payload areas pre-allocated at startup in the ucNetworkPackets buffer via the FreeRTOS_GetUDPPayloadBuffer function. The first position of the payloads are packet counters. As you can see the first area (array position 0) is corrupted
As mentioned, I am not the hardcore Cortex M SW developer, so this issue is really complex for me to debug out. I am wondering if I get some heap/stack corruption from time to time and I am trying to check if and how memory watchpoints work in Stm32CubeMX.
Any further suggestions are greatly appreciated here!
That is interesting. Since we know that this buffer is getting corrupted, we stop copying to this secondary buffer and put data breakpoints on it at some locations which we know get corrupted? This way we probably can catch corruption right when it happens.
I have one NUCLEO-H743ZI2 board. If you are using the same and are willing to share a minimal project demonstrating the problem, I can attempt to give it a try.
Thanks a log for the offer Gaurav, the thing is that I have developed a custom board with external ADCs that communicate through SAI and getting the code to run in a NUCLEO board will be difficult due to the missing ICs.
I have however news for all of you. I think I might be getting closer to the root cause of the issue, at least for the ethernet buffer (for the secondary I really need to check what I am doing wrong).
I have done the following:
1- I have stopped changing the packet counter in the reserved ethernet buffers that I’ve got through the FreeRTOS_GetUDPPayloadBuffer calls. I do not write anything there from my application code
2- I’ve placed a watchpoint in the first address of the first buffer where I typically have written the packet counter previously
3- After waiting patiently, around 2 Million packets latter, the watchpoint is triggered!, see below:
This is amazing - seems like we are really close. As you mentioned, the buufer you obtained using FreeRTOS_GetUDPPayloadBuffer is getting overwritten. Can you examine the values of pxBuffer and pxNewBuffer in the function pxDuplicateNetworkBufferWithDescriptor and see those are same (meaning the buffer somehow got allocated twice) or some areas overlap (indicating some issue with allocator)?