STM32H743 FreeRTOS+TCP issue with ZERO Copy, this time UDP

arijav · January 21, 2023, 7:16pm

Hi,

some months ago I got in my application TCP Zero Copy working, thanks to Hein and some contributors in the forum. Now I have tried the same with UDP Zero Copy to gain speed and it seems that randomly the payloads get corrupt. It happens very infrequently, around 1 time every 2 million packets. My application is very packet loss sensitive and even this packet every couple of million packets creates a problem. When using non zero copy I have no payload corruption issues at all.

Some background info to my application:

1- I have a dedicated ethernet (double shielded) connection between my computer where the application receives data and the STM32H7 acquisition board. I see no packet loss at all, just payload corruption

2- For UDP Zero Copy I preassign some buffers for the payloads in advance via FreeRTOS_GetUDPPayloadBuffer

3- I fill in the data into the payload buffers in a SAI ISR and send the packets in a task. I use a task notification system and make sure that the SAI ISR never overwrites a payload which is being sent by the task

4- I use BufferAllocation_1 and heap 4

5- My FreeRTOS+TCP is up-to-date

It seems that somebody had a similar issue in the past:

Any idea on how to resolve this?. I really doubt that it is an issue of my application, as I have almost no differences in code between Zero Copy and Non-Zero Copy.

I would really like to be able to solve the issue as UDP Zero Copy is noticiable faster.

Thanks for your support and Best Regards,

Javier

archigup · January 23, 2023, 8:54pm

Hey, what kind of corruption are you seeing in the corrupted packets?

Would you be able to provide a sample corrupted packet?

arijav · January 23, 2023, 9:48pm

Hi, the corruption happens in the payload not into the UDP header. The packet itself seems to be fine, as it is properly received by my application in Labview where I gather the packets and process them. I notice the corruption both in Labview as well as when debugging the STM32H7, as I have a packet counter integrated into the first 4 bytes of the payload and I can stop exactly when I realized that the sequence is broken. The packet counter (as well as the rest of the payload) gets some random data. I could gather a packet and show the payload to you but I do not think you will be able to extract any valuable info out of it.

aggarg · January 24, 2023, 6:46am

Are you able to break in debugger when the payload gets corrupted? If not, can you try the following -

Before calling the UDP send, copy the payload to a buffer (say buf1).
Right before handing over the buffer to the hardware in xNetworkInterfaceOutput, compare the payload with buf1.
Put a breakpoint if the content of the 2 buffers are not the same. This will confirm that the content of the buffer was corrupted somewhere in the stack.

Thanks.

arijav · January 24, 2023, 7:45am

Hi,

I have already verified that the issue does not happen during the sending process. To do so:

1- I’ve set a breakpoint before the packet is sent that gets triggered just if the packet counter in the payload is not plausible.

2- I send the payload.

3- I’ve set a breakpoint after sending the payload looking again as in point 1 to the packet counter corruption.

The packet counter was corrupted both before as well as after the payload was sent. I will repeat the test today in the evening to assure that it is the data is fully equally corrupted both and after sending the packet but I think the problem happens somewhere else.

BR,

Javier

aggarg · January 24, 2023, 8:30am

When you say before, do you mean that the corruption happens even before you call the UDP send? If yes, then the corruption is likely in the application.

A common technique to catch memory corruption is place a variable next to the one getting corruption and put a data breakpoint on that. Is it possible for you to do that? Also, can you share some code snippet to help us understand what memory is getting corrupted?

arijav · January 24, 2023, 9:21am

I will do the test today in the afternoon but I doubt it comes from the application as it does not touch the eth buffer array from FreeRTOS+TCP but in an ISR for zero copy transfers. When doing non zero copy I use my own application buffer array and this one never gets corrupted and it is written in the exact same was from the ISR

hs2 · January 24, 2023, 9:29am

Do you use cachable memory for the payload buffers ? Means you could have a cache coherency issue in this case.

arijav · January 24, 2023, 10:12am

Good point Hartmut, I will check how did I setup the memory protection areas. I thought in any case when doing non zero copy my local buffer would in any case be copied in the Ethernet buffer structure eventually, so I should see cache problems as well there if the issue is coherency and the area is cacheable. I will have a look today and give feedback

htibosch · January 24, 2023, 4:30pm

Thanks for you report.

You call FreeRTOS_sendto() with the FREERTOS_ZERO_COPY flag. And since you do so, you find that the contents of the payload buffer gets corrupted.

That surprise me because there is a minor difference between using FREERTOS_ZERO_COPY or not.

When the flag is not set, a network buffer is created and your buffer will be copied to that network buffer.

Like Hartmut, I also thought about cacheable memory.

When using zero-copy, the SAI will write to a different region of RAM. Can you tell what memory banks are involved. The ETH peripheral has complete access to that memory, but does the SAI also have full access? Does that work reliably?

I remember that the STM32H7 has many types of memories, which confused me quite a bit.

arijav · January 24, 2023, 10:24pm

Hi Hein, thanks for your input as well. Further to what I have described upwards:

1- I have done a kind of radical test by disabling the Data Cache (everything was very very slow, I have commented out SCB_EnableDCache() in main.c) and I still see corrupted payloads

2- The SAI DMA has its own very small buffers in a region that can be accessed by it. I copy manually those small buffers in parts of the payload packets in each SAI ISR. Due to this reason it does not really matter in which memory region the ethernet buffers are (from the point of view of SAI).

As I have no issue at all when using my own buffers w/o Zero Copy, I suspect the corruption could come from something that is happening in the ethernet buffer (ucNetworkPackets in ethernet_data section starting on 0x24040000), not when I send the payloads but somewhere else. As mentioned, everything works typically well, it is just a matter of 1 corrupted payload every 2 million packets sent (in my application that creates a big issue though).

I will modify a little bit the code and do parallel buffering both on the ethernet buffer as well as on my own one in the SAI ISRs and compare them once I detect a payload corruption. I will let you know about the results.

BR,

Javier

glenenglish · January 27, 2023, 8:29pm

Given that corruption is rare…

Are these buffers in general SRAM or TCM ?
1.5) Is this between identical systems, or a say a linux box sending packets and your STM32 system ?
What happens if you reduce the number of buffers- that is reduce the memory footprint- does the corruption rate decrease ?
Is your corruption rate measurement statistically valid ? (enough measurements) That is, what does a histogram of the corruption look like- is it single mode? spread out ? time related ?
How many packets AND how much time can pass for NON zero copy ?
what happens if network the PHY is on internal test loopback ?
Is the corruption of the payload spread out or in one region/location ?
Are network MAC CRCs known to be working by means of error injection using the PHY test ?

arijav · January 28, 2023, 7:53am

Hi,

thanks to all for your comments and support!. I have experimented a little bit since my last post, but unfortunately did not manage to solve it (I am a little bit out of my depth with this topic).

I have:

1- Moved the ethernet_data section from the DTCRAM section to D2 RAM → Same issue. In all those areas data cache was disabled
2- Increased the heap and stack size strongly (I think it worked, still need to verify) → Same issue
3- Created a parallel buffer which is being filled in with the same data in the SAI ISR just after the ethernet one → Very interesting here. Both data buffers get partially corrupted once every couple of million packets, but in different areas. I show you a capture below:

BaseBuffAddress is an array of pointers that points to the payload areas pre-allocated at startup in the ucNetworkPackets buffer via the FreeRTOS_GetUDPPayloadBuffer function. The first position of the payloads are packet counters. As you can see the first area (array position 0) is corrupted

Now here the second buffer, not related to ethernet, that gets the same data copied in the SAI ISR. As you can see here, the positions 2,3 and 4 of the array got corrupted:

As mentioned, I am not the hardcore Cortex M SW developer, so this issue is really complex for me to debug out. I am wondering if I get some heap/stack corruption from time to time and I am trying to check if and how memory watchpoints work in Stm32CubeMX.

Any further suggestions are greatly appreciated here!

aggarg · January 29, 2023, 4:01pm

That is interesting. Since we know that this buffer is getting corrupted, we stop copying to this secondary buffer and put data breakpoints on it at some locations which we know get corrupted? This way we probably can catch corruption right when it happens.

I have one NUCLEO-H743ZI2 board. If you are using the same and are willing to share a minimal project demonstrating the problem, I can attempt to give it a try.

arijav · February 1, 2023, 8:38am

Thanks a log for the offer Gaurav, the thing is that I have developed a custom board with external ADCs that communicate through SAI and getting the code to run in a NUCLEO board will be difficult due to the missing ICs.

I have however news for all of you. I think I might be getting closer to the root cause of the issue, at least for the ethernet buffer (for the secondary I really need to check what I am doing wrong).

I have done the following:
1- I have stopped changing the packet counter in the reserved ethernet buffers that I’ve got through the FreeRTOS_GetUDPPayloadBuffer calls. I do not write anything there from my application code
2- I’ve placed a watchpoint in the first address of the first buffer where I typically have written the packet counter previously
3- After waiting patiently, around 2 Million packets latter, the watchpoint is triggered!, see below:

It seems that the reserved packet area is overwritten by a memcpy process within the pxDuplicateNetworkBufferWithDescriptor call, which is only triggered when ZERO copy is enabled:

Any clue if I am doing something wrong, or is it a FreeRTOS+TCP bug?

To add more information to my original post. I have 3 connections with my Labview computer ongoing, 1 UDP for ADC data and 2 TCP ones for control purposes.

Thanks again for your support and Best Regards!

aggarg · February 1, 2023, 10:00am

This is amazing - seems like we are really close. As you mentioned, the buufer you obtained using FreeRTOS_GetUDPPayloadBuffer is getting overwritten. Can you examine the values of pxBuffer and pxNewBuffer in the function pxDuplicateNetworkBufferWithDescriptor and see those are same (meaning the buffer somehow got allocated twice) or some areas overlap (indicating some issue with allocator)?

arijav · February 1, 2023, 10:11am

The pxNewBuffer->pucEthernetBuffer gets into my packet reserved area as uxLengthToCopy is 70. In red you can see the collision:

htibosch · February 1, 2023, 4:00pm

Can you explain more in detail what we are looking at?

aggarg · February 1, 2023, 4:55pm

@htibosch We had a call today and I think we have narrowed it down to the following -

The application calls FreeRTOS_GetUDPPayloadBuffer multiple times in the beginning and reserves those buffers for the lifetime of the application - these are used for sending application data.
When the memory corruption happened, we saw that one of these network buffers was getting used in pxDuplicateNetworkBufferWithDescriptor which indicated that probably it was released at certain point.

@arijav is trying to add some debugging code to find out where is that buffer getting released.

arijav · February 1, 2023, 7:13pm

Hello Hain and Gaurav, I will try to implement the checks to verify if the buffers have been released in the next few days. I will let you know the results as soon as possible.

Thanks again for your valuable support