STM32H743 FreeRTOS+TCP issue with ZERO Copy

So, last update and probably final here. I had to modify slightly the file “FreeRTOS_Sockets.c” to manage 100% use of the heap TCP buffer instead of my secondary one. If I set the lTxBufSize to my Payload size (1400) minus 4 the buffer size was rounded due to the MSS size, see “FreeRTOS_Sockets.c” row 1722:

            if( lOptionName == FREERTOS_SO_SNDBUF )
            {
                /* Round up to nearest MSS size */
                ulNewValue = FreeRTOS_round_up( ulNewValue, ( uint32_t ) pxSocket->u.xTCP.usMSS );
                pxSocket->u.xTCP.uxTxStreamSize = ulNewValue;
            }

I have commented out the rounding and now the TCP Tx Buffer is always a multiple of my payload size and I never have to get into my secondary buffer (still there in case there is some issue).

I manage to get 72Mbps TCP stream now and considering the high load (the 20us ISRs) I think it is quite fine for my application. I will eventually do a benchmark against LWIP, but for now I am really satisfied and I will move on with areas in my application where I have to improve.

Thanks for your support!

I manage to get 72Mbps TCP stream now

You make my day! Thank you for your patience. I will check the piece of code that you mention.

Great achievement :+1:
I agree with you that you’re probably on the edge of your system with the pretty high interrupt rate of the data source. I would be surprised if you’d get comparable performance with lwIP.

By the way, I’ve modified FreeRTOS_sockets.c/.h and added an advance function that helps me fully use the ETH buffer in the heap together with FreeRTOS_get_tx_head:

/**
 * @brief Get a direct pointer to the first element of the buffer.
 *
 * @param[in] xSocket: The socket owning the buffer.
 *
 * @return First element of the circular transmit buffer if all checks pass. Or else, NULL
 *         is returned.
 */
    uint8_t * FreeRTOS_get_first_buff_element( ConstSocket_t xSocket )
    {
        uint8_t * pucReturn = NULL;
        const FreeRTOS_Socket_t * pxSocket = ( const FreeRTOS_Socket_t * ) xSocket;
        StreamBuffer_t * pxBuffer = NULL;

        /* Confirm that this is a TCP socket before dereferencing structure
         * member pointers. */
        if( prvValidSocket( pxSocket, FREERTOS_IPPROTO_TCP, pdFALSE ) == pdTRUE )
        {
            pxBuffer = pxSocket->u.xTCP.txStream;

            if( pxBuffer != NULL )
            {
                pucReturn = &( pxBuffer->ucArray[ 0 ] );
            }
        }

        return pucReturn;

As I fill in the buffer progressively, I start writing my data in “FreeRTOS_get_tx_head() + Payload size” and I pass NULL to the FreeRTOS_Send function to get the payload which is already complete get sent. When I see that the head is one payload away from the end, I call my function and I start writing data at the beginning of the buffer. FreeRTOS_Send with NULL argument then sends the previous payload already complete that lays at the end of the buffer.

Just my 2 cents, might be worth considering getting this into the official code for the advance users. The speeds I get now are crazy (>80Mbps) but then other tasks stop to work properly, so actually I’ve to stay at max 72Mbps for my application to work properly

As I fill in the buffer progressively, I start writing my data in “FreeRTOS_get_tx_head() + Payload size”

So you create your own temporary HEAD pointer in the buffer? And I guess that you write packets of 140 bytes each?

I pass NULL to the FreeRTOS_Send() function to get the payload which is already complete get sent.

So now, without further copying, the IP-task sends like 1400 bytes?

When I see that the head is one payload away from the end, I call my function and I start writing data at the beginning of the buffer. FreeRTOS_Send with NULL argument then sends the previous payload already complete that lays at the end of the buffer.

I assume that you found that calling FreeRTOS_Send() for each small block ( 140 bytes ) is much slower than calling it for large blocks ( of 1400 bytes ) only?

The speeds I get now are crazy (>80Mbps)…

Very good, that is also what I also found!

Just my 2 cents, might be worth considering getting this into the official code for the advance users.

Yes, I don’t mind adding that function. I would propose some minor (non-functional) changes:

/**
 * @brief Get a pointer to the first element of the TX stream buffer.
 *
 * @param[in] xSocket: The socket owning the buffer.
 *
 * @return First element of the circular transmit buffer if all checks pass. Or else, NULL
 *         is returned.
 */
uint8_t * FreeRTOS_get_tx_base( ConstSocket_t xSocket )
{
    uint8_t * pucReturn = NULL;
    const FreeRTOS_Socket_t * pxSocket = ( const FreeRTOS_Socket_t * ) xSocket;

    /* Confirm that this is a TCP socket before dereferencing structure
     * member pointers. */
    if( prvValidSocket( pxSocket, FREERTOS_IPPROTO_TCP, pdFALSE ) == pdTRUE )
    {
        StreamBuffer_t * pxBuffer = pxSocket->u.xTCP.txStream;

        if( pxBuffer != NULL )
        {
            pucReturn = pxBuffer->ucArray;
        }
    }

    return pucReturn;
}

I create my own temporary head pointer where I write chunks of 140 bytes (data gathered in one ISR call). I trigger FreeRTOS_Send when the payload is full → after 10 ISRs = 1400 bytes

Correct, I trigger FreeRTOS_Send() with NULL as buffer pointer and 1400 as length. The IP Task sends then the payload which is already complete. At the same time I start constructing my next payload in HEAD + One Payload (1400 bytes) or BASE if my HEAD is 1 payload away from the end of the buffer.

Thanks for the changes, I have taken them over into my code as well

Hi Hein,

I’m also evaluating performances with ipref3 on STM32H757 using your code.

Would you know why I get so many retry when I run the test in reverse mode?
Do you get the same?

Thank you for the amazing work you’re doing!

Stefano

Reverse mode, remote host 10.41.16.253 is sending
[  4] local 10.41.16.29 port 52098 connected to 10.41.16.253 port 5001
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  11.1 MBytes  93.0 Mbits/sec
[  4]   1.00-2.00   sec  11.0 MBytes  92.1 Mbits/sec
[  4]   2.00-3.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   3.00-4.00   sec  11.0 MBytes  92.1 Mbits/sec
[  4]   4.00-5.00   sec  11.0 MBytes  92.1 Mbits/sec
[  4]   5.00-6.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   6.00-7.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   7.00-8.00   sec  11.0 MBytes  92.2 Mbits/sec
[  4]   8.00-9.00   sec  11.0 MBytes  92.1 Mbits/sec
[  4]   9.00-9.09   sec  1.03 MBytes  91.8 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-9.09   sec  37.0 Bytes  32.5 bits/sec  4294967295             sender
[  4]   0.00-9.09   sec   100 MBytes  92.2 Mbits/sec                  receiver

iperf Done.

C:\Users\ugo64915\Downloads\iperf-3.1.3-win64>

Hi Stefano,

why I get so many retry

Do you mean “Retr 4294967295” ?

The number looks like (a 32-bit) -1. I don’t know what it stands for.

When I look at a PCAP of an iperf connection, I don’t see any retries.

Your iperf results are even better than what I got here.
Thanks

Hi Hein,

Yes, by retry I meant Retr in the ipref printout. In your previous example it was omitted by I expect it was the same.

With a little more digging, it looks like is because of partial implementation of ipref.

In iperf_task_v3_0d.c

			ulLength = snprintf( pcResponse + 4, sizeof( pcResponse ) - 4,
				"{"
					"\"cpu_util_total\":0,"
					"\"cpu_util_user\":0,"
					"\"cpu_util_system\":0,"
					"\"sender_has_retransmits\":-1,"
					"\"streams\":["
						"{"
							"\"id\":1,"
							"\"bytes\":%lu,"
							"\"retransmits\":-1,"
							"\"jitter\":0,"
							"\"errors\":0,"
							"\"packets\":0"
						"}"
					"]"
				"}\xe",
				ulCount );

The response to the iperf server has hardcoded values.
“"sender_has_retransmits":-1,”
“"retransmits":-1,”

This is what is why the iperf server is printing Retr 4294967295.
I guess it can be ignored if PCAP doesn’t show retries.

Changing

"\"sender_has_retransmits\":-1,"

to

"\"sender_has_retransmits\":0,"

would stop the server from printing the Retr value.

Thank you
Cheers

Very good of you, I will change the code accordingly.

As you see, I never implemented the 'retransmit’s feature.
I analysed the protocol by running two iperf instances, talking with each other. There I saw the JSON expression:
sender_has_retransmits:-1, which I just copied.

“Retr” is only printed when the -R option is used: the embedded server sends data.

Thanks

I created PR #544 which adds the new function FreeRTOS_get_tx_base().

Would you be able to check it?

I tested it as follows:

{
    BaseType_t xLength;
    uint8_t * pucBase = FreeRTOS_get_tx_base( xSocket );
    uint8_t * pucHead = FreeRTOS_get_tx_head( xSocket, &( xLength ) );
    FreeRTOS_printf( ( "httpTest: TX base = %p head = %p diff %u\n",
                       ( void * ) pucBase,
                       ( void * ) pucHead,
                       ( unsigned ) ( pucHead - pucBase ) ) );
}

When called before the first FreeRTOS_send(), the functions will create the TX buffer.
It will return NULL in case there was not enough heap.

Thanks Hein, I will integrate and test it. It will take some time as I am currently solving some other topics in my application, but I will certainly do it and give feedback

Hi Hain, sorry for giving my feedback so late, the modifications work well. I just have to change for my application, as I fill-in directly the FreeRTOS TCP buffer, some changes in freertos_sockets.c (line 1760):

            /* Round up to nearest MSS size */
            /* Modification */
            //ulNewValue = FreeRTOS_round_up( ulNewValue, ( uint32_t ) pxSocket->u.xTCP.usMSS );
            pxSocket->u.xTCP.uxTxStreamSize = ulNewValue;

Rounding up does not work for me.

BR,

Javier

btw, I believe this has not been mentioned here, so for completeness’ sake: On the Cortex M, accessing internal memory is dramatically faster than external memory which the driver apparently takes into consideration. Anyone who relocates the dynamic heap to external memory may experience a drastic performance degregation.

if you are looking for more ways to fine tune your system, you may want to inspect your map file to see if any data accessed by the buffer chain resides in external memory and if so find a way to relocate it.

Along the same lines, you may also want to experiment with speed optimization if you have ample flash left (needless to say, this will not affect anything transferred by DMA); speed optimized code - even more so when run from internal flash - can bloat to a mutiple over size optimized code but may yield throughputs enhancements in the multitude range.