FreeRTOS+TCP v4.0.0 performance degradation

We’ve just upgraded from FreeRTOS+TCP v3.1.0 to v4.0.0

We have a task sending out chucks of data to the connected TCP client if FreeRTOS_recv returns 0. (timeout).

We notice a quite significant degradation in performances compared to v3.1.0.
Something around 30%.

Is this expected?
Are you aware of anything like this in your tests?

Thanks
Stefano

Hello Zugo - can you share bit more details about the hardware you are using and the changes you might have made after moving from v3.1.0 to v4.0.0. If it is possible to post any sample code which can reproduce the problem, that would be great.

We did not observe any performance degradation and we did run some benchmarking using iperf3 and saw same or minor improvement in performance as compared to v3.1.0

CC: @Shub @tony-josi-aws @htibosch @moninom1

Hi Nikhil, thank you for your reply.

FreeRTOS TCP runs on a STM32H745 MCU.
We are using the NetworkInterface.c file provided in the STM32Hxx folder with BufferAllocation_1.c and DMA

We changed from FreeRTOS+TCP v3.1.0 (0bf460c) to v4.0.0 (b41e57e).

To update to v4 we only changed FreeRTO_IPInit to match the new implementation.
Also I had to make pxSTM32H_FillInterfaceDescriptor public in NetworkInterface.h

Our system implements a TCP Server with only 1 possible client connection.
With the changes described above, the system was working same as with v3.1.0 but now FreeRTOS_Send takes 600us to complete while it took 10us with v3.1.0.

FreeRTOSIPConfig.h (19.0 KB)

Do you have an example of v4.0.0 working with and STM32H7 and DMA?

Side note: NetworkInterface.c for STM32Hxx has the wrong release version since it says “FreeRTOS+TCP V2.3.2”

Thanks!
Stefano

Compared to v3 and v4 the TCP part of the FreeRTOS+TCP library hasnt changed much that could have a significant impact on performance.

Is there a difference in the application used to test both these versions? How many endpoints are initialized when tested with the newer version as the older one didn’t had multiple endpoint support? Are you using IPv6?

Hi Tony,

We’re not using IPv6.
Only 1 endpoint is initialized. The size of the endpoint array is 1, as well as the interface.
The measurement of the FreeRTOS_Send execution time, it taken setting and resetting a GPIO before and after, and measuring the time with a logic analyzer.

I’ll try to narrow down where all that CPU time is used.

Here is a sample project that is tested on STM32 Nucleo H723ZG. Note that this is a sample project (IPv4/v6, multi endpoint) and is not performance tested.

but now FreeRTOS_Send takes 600us to complete while it took 10us with v3.1.0.

Just curious to know how the execution time is calculated in the application, wondering if there was a context switch to another task that happened in between while time was measured.

@zugo83 wrote:

Are you aware of anything like this in your tests?

There can be a small difference in efficiency, because the checks on security and correctness have become stricter.

Today I tested TCP transmission on my STM32H747, using iperf3. The CPU is running on 400MHz. Iperf gets loads of buffers:

    #define ipconfigIPERF_TX_BUFSIZE (24 * ipconfigTCP_MSS)
    #define ipconfigIPERF_TX_WINSIZE (12)
    #define ipconfigIPERF_RX_BUFSIZE (24 * ipconfigTCP_MSS)
    #define ipconfigIPERF_RX_WINSIZE (12)

The STM32H was sending either 2920, 5840, or 8760 bytes during each call to send().

Sending 5840 bytes takes an average of 425 µS, which comes close to your 600 µS. The throughput is perfect though:

tibosch@laptop-hp:~$ iperf3 -c 192.168.2.107 --port 5001 --bytes 100M -R
Connecting to host 192.168.2.107, port 5001
Reverse mode, remote host 192.168.2.107 is sending
[  4] local 192.168.2.11 port 52503 connected to 192.168.2.107 port 5001
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  9.96 MBytes  83.6 Mbits/sec
[  4]   1.00-2.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   2.00-3.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   3.00-4.00   sec  11.0 MBytes  92.3 Mbits/sec
[  4]   4.00-5.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   5.00-6.00   sec  11.0 MBytes  92.2 Mbits/sec
...

( all measured with IPv4 along with the latest +TCP library from github )

Can you share a PCAP of the DUT performing badly?

Mind you that FreeRTOS_send() is an almost empty function, all it does is wait for space in the circular transmission buffer and pass on the data. While your application is calling FreeeRTOS_send(), it is mostly sleeping.
When data has been passed to the stack, a message is sent to the IP-task so it can work on the transmission. Normally the IP-task has a higher priority than the application, which means that the time measured includes the processing done by the IP-task.

I have uploaded the project that I just used for performance testing: https://github.com/htibosch/freertos_plus_projects/tree/master/STM32H747_cube_multi

Hi Hein and Tony,

thanks again for your suggestions.

The experiment I’m doing is the following:

I have a python script that send a command to get the system time over a TCP/IP connection 10 times. One reply is 17 bytes.
I have surrounded the FreeRTOS_Send call with a GPIO that goes HIGH before send and LOW after.

The only changes in the FW is the FreeRTOS TCP version and the few adjustments needed to make it work with the new multi-endpoint initialization (1 end-point, 1 interface)

I’m attaching the 2 PCAP files (v4 is the bad one).

I tried also to remove all other tasks we have in the project, but the issue persist.

Some questions:

  1. Will FreeRTOS_Send be blocked until the IP task has completed the transmission?
  2. If I enable vTaskGetRunTimeStats would it be possible to check the IP task CPU time?

Tomorrow I’ll try to run the same experiment with your example project.
10-send-v3.zip (966.8 KB)
10-send-v4.zip (351.8 KB)


Hi Hein,

I think I found where the issue comes from:

V3.1.0 has an OR while V4.0.0 has an AND.
Replacing the AND with the OR fixed the FreeRTOS_Send execution time.

In my debugging session, I saw that xBytesLeft is 0, but break is not executed, reaching xEventGroupWaitBits where it waits for 600us.

Do you know why this was changed in v4, and what would be the impact if I change it back to OR condition?

2 Likes

Wow, that sounds like a good diagnostic!

When I was testing the performance iperf3, I was using the zero-copy method, meaning that my pvBuffer buffer was mostly NULL. So that is why I didn’t notice any slowness.

I think that we should turn this into a PR, and thank you very much for the observation.

I’m quick today and I created a Pull Request for this: PR #1043.
Please comment if you like, and please confirm that the change helps to get the performance good again.

Thanks,
Hein

1 Like

Thanks for the making the PR.

One more question,
why is the buffer null when using zero-copy method?
I was thinking that I was also using zero-copy, but my buffer is not null.

Cheers
Stefano