Hello Zugo - can you share bit more details about the hardware you are using and the changes you might have made after moving from v3.1.0 to v4.0.0. If it is possible to post any sample code which can reproduce the problem, that would be great.
We did not observe any performance degradation and we did run some benchmarking using iperf3 and saw same or minor improvement in performance as compared to v3.1.0
FreeRTOS TCP runs on a STM32H745 MCU.
We are using the NetworkInterface.c file provided in the STM32Hxx folder with BufferAllocation_1.c and DMA
We changed from FreeRTOS+TCP v3.1.0 (0bf460c) to v4.0.0 (b41e57e).
To update to v4 we only changed FreeRTO_IPInit to match the new implementation.
Also I had to make pxSTM32H_FillInterfaceDescriptor public in NetworkInterface.h
Our system implements a TCP Server with only 1 possible client connection.
With the changes described above, the system was working same as with v3.1.0 but now FreeRTOS_Send takes 600us to complete while it took 10us with v3.1.0.
Compared to v3 and v4 the TCP part of the FreeRTOS+TCP library hasnt changed much that could have a significant impact on performance.
Is there a difference in the application used to test both these versions? How many endpoints are initialized when tested with the newer version as the older one didn’t had multiple endpoint support? Are you using IPv6?
We’re not using IPv6.
Only 1 endpoint is initialized. The size of the endpoint array is 1, as well as the interface.
The measurement of the FreeRTOS_Send execution time, it taken setting and resetting a GPIO before and after, and measuring the time with a logic analyzer.
I’ll try to narrow down where all that CPU time is used.
Here is a sample project that is tested on STM32 Nucleo H723ZG. Note that this is a sample project (IPv4/v6, multi endpoint) and is not performance tested.
but now FreeRTOS_Send takes 600us to complete while it took 10us with v3.1.0.
Just curious to know how the execution time is calculated in the application, wondering if there was a context switch to another task that happened in between while time was measured.
The STM32H was sending either 2920, 5840, or 8760 bytes during each call to send().
Sending 5840 bytes takes an average of 425 µS, which comes close to your 600 µS. The throughput is perfect though:
tibosch@laptop-hp:~$ iperf3 -c 192.168.2.107 --port 5001 --bytes 100M -R
Connecting to host 192.168.2.107, port 5001
Reverse mode, remote host 192.168.2.107 is sending
[ 4] local 192.168.2.11 port 52503 connected to 192.168.2.107 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 9.96 MBytes 83.6 Mbits/sec
[ 4] 1.00-2.00 sec 11.0 MBytes 92.2 Mbits/sec
[ 4] 2.00-3.00 sec 11.0 MBytes 92.2 Mbits/sec
[ 4] 3.00-4.00 sec 11.0 MBytes 92.3 Mbits/sec
[ 4] 4.00-5.00 sec 11.0 MBytes 92.2 Mbits/sec
[ 4] 5.00-6.00 sec 11.0 MBytes 92.2 Mbits/sec
...
( all measured with IPv4 along with the latest +TCP library from github )
Can you share a PCAP of the DUT performing badly?
Mind you that FreeRTOS_send() is an almost empty function, all it does is wait for space in the circular transmission buffer and pass on the data. While your application is calling FreeeRTOS_send(), it is mostly sleeping.
When data has been passed to the stack, a message is sent to the IP-task so it can work on the transmission. Normally the IP-task has a higher priority than the application, which means that the time measured includes the processing done by the IP-task.
I have a python script that send a command to get the system time over a TCP/IP connection 10 times. One reply is 17 bytes.
I have surrounded the FreeRTOS_Send call with a GPIO that goes HIGH before send and LOW after.
The only changes in the FW is the FreeRTOS TCP version and the few adjustments needed to make it work with the new multi-endpoint initialization (1 end-point, 1 interface)
I’m attaching the 2 PCAP files (v4 is the bad one).
I tried also to remove all other tasks we have in the project, but the issue persist.
Some questions:
Will FreeRTOS_Send be blocked until the IP task has completed the transmission?
If I enable vTaskGetRunTimeStats would it be possible to check the IP task CPU time?
Tomorrow I’ll try to run the same experiment with your example project. 10-send-v3.zip (966.8 KB) 10-send-v4.zip (351.8 KB)
When I was testing the performance iperf3, I was using the zero-copy method, meaning that my pvBuffer buffer was mostly NULL. So that is why I didn’t notice any slowness.
I think that we should turn this into a PR, and thank you very much for the observation.
I’m quick today and I created a Pull Request for this: PR #1043.
Please comment if you like, and please confirm that the change helps to get the performance good again.