STM32H743 FreeRTOS+TCP issue with ZERO Copy

Hello,

I have managed successfully to replace LWIP with FreeRTOS+TCP in my STM32H743 project. I manage now to send high amounts of data over TCP (ADC values) quickly and in a reliable way (with LWIP communication stopped after 10 hours approx.).

In order to be able to send TCP packets I had to disable the ZERO Copy Driver though, as when enabled no packets were sent. As speed is of the essence in my application, I would like to ask you about your experiences with the ZERO copy driver for TCP TX and RX with the STM32H743. I would really like to understand the problem in my setup.

Some information from how I use the FreeRTOS+TCP:

TCP transmission code:

/* Keep sending until the entire buffer has been sent. /
while( xAlreadyTransmitted < xTotalLengthToSend )
{
/
How many bytes are left to send? /
xLenToSend = xTotalLengthToSend - xAlreadyTransmitted;
xBytesSent = FreeRTOS_send( /
The socket being sent to. /
xSocket,
/
The data being sent. /
&(buff[ xAlreadyTransmitted ]),
/
The remaining length of data to send. /
xLenToSend,
/
ulFlags. */
0 );

  if( xBytesSent >= 0 )
  {
      /* Data was sent successfully. */
      xAlreadyTransmitted += xBytesSent;
  }
  else
  {
      /* Error - break out of the loop for graceful socket close. */
      break;
  }

}

The Buffer (buff) is placed in the D2 RAM (SRAM3: 0x30040000)

The Ethernet descriptors (DMARxDscrTab) are placed as well in D2 RAM but in SRAM2: 0x30020000

The MPU regions are configured and I assume they work ok, as w/o the ZERO copy the driver behaves perfectly.

Do you have any clue where my problem could come from?.

Thanks

Which ethernet driver do you use ?
There is one known to work well here

Hello, I have just compared the code with the following outcome:

  • The ETH driver is exactly the same as mine (stm32hxx_hal_eth.c/.h)
  • My version of the NetworkInterface.c is newer (2.4.0 instead of 2.3.4)

Sounds good if you’re using this driver. Strange that zero copy doesn’t work for you.
This should work since it’s the recommended configuration and for sure tested and used by @htibosch. Maybe he has an idea what might be wrong.

Hola Javier, @hs2,

When I created a testing project I had both zero-copy for transmission and reception:

    #define ipconfigZERO_COPY_RX_DRIVER    ( 1 )
    #define ipconfigZERO_COPY_TX_DRIVER    ( 1 )

It showed a really high performance!

What I do remember is my confusion about all types of memories that do or don’t work together with DMA. I wrote about that in the readme.md.

This is the project that I used for testing. Normally my Github repo’s run behind the FreeRTOS Github, but in this case my release is just a bit newer. Please check this project, and if it doesn’t work for you can report back here.

Thanks

Hola Hein,

thanks for your quick reply!. I had looked at your code to build my application. The only differences I see are:

1- You use Bufferallocation 2 and you use Bufferallocation 1. I am not deep into FreeRTOS, I will have a look at the implications and test as well with yours.

2- You use the 2.3.4 revision for the files, I use the 2.4.0. I can test as well with yours, but the code related to the Zero Driver seems to be exactly the same.

3- You have put ethernet_data in AXI__RAM and I’ve put it in D2 RAM (SRAM2). Actually the Ethernet data should be put in D2 RAM according to ST probably to avoid going over the memory bridge. Nevertheless I can try as well your configuration.

In your application, did you ping mainly or have you tested as well in the past the TCP send function (FreeRTOS_send)?

Thanks,

Javier

For sure he also did performance testing (with iperf) :+1:

@arijav wrote:

You use Bufferallocation_2 and you use Bufferallocation_1

Bufferallocation_1 is the one you need because it uses the pre-allocated buffers that you see here.
“_2” calls pvPortMalloc() to allocate network buffers, which is probably the wrong type of RAM.
Within NetworkInterface.c, pxGetNetworkBufferWithDescriptor() is called in two locations, in pucGetRXBuffer() and in prvNetworkInterfaceInput().

You can find a good description of the allocation schemes here.

You use the 2.3.4 revision for the files, I use the 2.4.0

About the versions: that is a bit confusing, the latest version of the network interface can be found in the official repo here.
I just upgraded the driver in my freertos_plus_projects, so now they are the same.

The version notation has be taken out of the source files, and can now be found in FreeRTOS_IP.h only.

You have put ethernet_data in AXI__RAM and I’ve put it in D2 RAM (SRAM2)

Just play with these parameters. I used the maximum possible size in order to do performance testing.

@hs2 wrote:

For sure he also did performance testing (with iperf)

Hartmut knows me, I like iperf3 TCP testing. Not only to test performance, but also the robustness. You find my iperf3 server here.

A second good test is in Linux: sudo ping /f 192.168.1.100, also called flood ping.

EDIT: Please note that all libraries in my freertos_plus_projects repo are out of date. Please use the official ones, either an LTS version or the latest releases.

Hello Hein,

again thanks for taking time to support with my issue. I am using now Bufferallocation_1 but I get the same issue, w/o ZERO copy everything works fine, with it enabled the communication does not work properly.

2 things I have noticed:

1- I am using heap 4 instead of heap 5 as you do. Do you think this can be problematic?. The heap from FreeRTOS and the ethernet_data are in different memory banks

2- I see with Wireshark that TCP packets are indeed sent but on one hand they do not have the correct length (1460 bytes instead of 1360 bytes) and on the other I get retransmission errors

Do you have any clue if the heap 4 topic could be the issue?. I will try to use heap 5 but it takes some rework as I am using the SRAM 3 (0x30040000) in the D2 RAM for the DMA data coming over SAI buses from external ADC’s.

Thanks,

Javier

As Hein mentioned when using Bufferallocation_1 the heap is not involved / used in the ethernet driver. But the different MTU sizes sound promising to find the problem you have.
What’s your ipconfigTCP_MSS in FreeRTOSIPConfig.h ? And can you give it a try using the std/default MSS of 1500 (e.g. by commenting out ipconfigTCP_MSS in your FreeRTOSIPConfig.h) ?

Hello Hartmut,

in my FreeRTOSSIPConfig.h I have not defined ipconfigTCP_MSS, so I assume it takes the default value from FreeRTOSSIPConfigDefaults.h:

#define ipconfigTCP_MSS** ( ipconfigNETWORK_MTU - ( ipSIZE_OF_IPv4_HEADER + ipSIZE_OF_TCP_HEADER ) )

My ipconfigNETWORK_MTU is 1500 and ipSIZE_OF_IPv4_HEADER and ipSIZE_OF_TCP_HEADER are 20 each, so ipconfigTCP_MSS would be 1460

Nevertheless I understand this as the upper limit and what I am sending in length with FreeRTOS_send are 1360 bytes… As mentioned w/o the ZERO copy it works perfectly. I have tested as well with an MTU of 1400bytes to force ipconfigTCP_MSS to be 1360 and while I see packets in Wireshark with the correct length of 1360, they come sporadically (the rate is minimal), the data is not correct and I see plenty of packet issues (TCP Previous segment not captured, TCP Retransmission, etc…).

If the heap differences are not the issue I am wondering where is the problem…

Thanks,

Javier

So, I have news. I was actually due to the amount of the ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS 64 overflowing the D2 RAM Block used for ethernet_data, it seems.

I have changed ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS to 32 and now it works. There is however still an issue, as I get far better speeds without the ZERO Copy as with it…

I will now check with other memory blocks, such as the AXI_RAM, let’s see what happens

Ok, issue solved. Now the ZERO copy is running an slightly faster as w/o it.

Thanks a lot to you Hartmut and to Hein for your valuable support!

Great news Javier :slight_smile: … although I’d expect a noticeable better performance compared to the copy mode.
There might be some possibilities to tweak the stack in case it’s still too slow.
Also searching the forum I found e.g. this benchmark and also others.

One last question Hein and Hartmut. I think, even if working, I might be doing something inefficient from the memory perspective in my application.

I am buffering the payload of the packets (total 80 packets, each payload 1400bytes) in a big array in D2 RAM. I pass the address of the payload to be send to FreeRTOS_Send in each call. ZERO copy is enabled so I assume the payload will not be copied to the ucNetworkPackets array. I see however that the array ucNetworkPackets, which is now defined in AXI_RAM is quite big, actually ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS * ETH_RX_BUF_SIZE, so actually it would have as well again space for the whole payloads + headers.

Somehow due to the Zero copy I have the feeling that I am using twice as much ram as necessary when actually it would not be needed. Am I making something wrong?. Why is ucNetworkPackets that big if the payloads are anyway non copied (ZERO copy)?

To understand better why I am buffering in D2 RAM. The buffer is being filled up by DMA interrupts from the SAI buses. This happens very fast, so if I do not buffer there by the time the FreeRTOS Tasks (IP as well as the one I use to call FreeRTOS_send) run I might have already lost the payload by being overwritten by the next batch of interrupts. Therefore I need to buffer there for sure first.

Thanks

Indeed, the +TCP library already reserves a lot of RAM for the network buffers.

A standard implementation would involve:

	BaseType_t rc = 0;
	static char pcBuffer[2048];
	do
	{
		rc = read( pcBuffer, sizeof pcBuffer );
		if( rc > 0 )
		{
			rc = FreeRTOS_send( xSocket, pcBuffer, rc, 0 );
		}
	}
	while( rc > 0 );

FreeRTOS_send() will copy the data from pcBuffer to the internal stream buffer of the socket.

Not sure if it makes a big difference, but you can also call read() with a pointer to the socket’s stream buffer.

Here is an example of a FTP server which reads from disk while passing a memory buffer that belongs to the TCP socket.

In this example, the data will sometimes be copied, sometimes it will be passed directly to the stream buffer.
The reason is that the driver only wants to call ff_fread() to read a multiple of 512 bytes.

Note that the use of zero-copy for TCP is a bit complex, and the savings are not really high, a few percentage. The above traditional example is much easier to understand.

each payload 1400bytes

Have you looked what MSS is chosen when you send the data? Is that also 1400 bytes?
You could play with either ipconfigTCP_MSS or ipconfigNETWORK_MTU to change the chosen MSS for the TCP connection. It might be more efficient if the two are “in tune”.

In case your data is not “classified”, could you make a PCAP of a TCP session and send that in a ZIP file? Just a fragment of a few MB is enough.

About changing the MSS (Maximum Segment Size) for TCP.

In most cases, ipconfigNETWORK_MTU is defined to 1500 and ipconfigTCP_MSS automatically becomes 1460.

Now if you define ipconfigNETWORK_MTU as 1440, MSS will automatically become 1400.

If not, you can leave MTU to 1500 and just define ipconfigTCP_MSS as 1400.

Also for the receive path network buffers have to be reserved.
This is pretty similar to your SAI data send path. I’m currently unsure how to tune the stack/buffers for very asymmetric traffic like in your case (?) i.e. mainly sending data (and receiving ACKs for it).

Thanks Hein, I think I understand now the situation better. I can limit the MTU to 1440 and gain some RAM, nevertheless the real issue I think, looking at your FTP example, is that I am not really doing TCP Zero copy right now, even if it is active by defines.

In my case it would be a challenge, in real time, to get in the DMA interrupts (which are very fast) the head of the tx buffer and fill it in (FreeRTOS_get_tx_head). I think I will have to stick to two separate buffers for now. I will however revisit this issue in the future and check if find a way to improve.

Just one question, ucNetworkPackets includes the whole TX packet, right?, so payload + header