TCP w/ TX Zero Copy

michaelyoyo · January 8, 2024, 2:44pm

When using the TX zero copy interface, is there a way to force the TX stream buffer to reset or wrap around before the data reaches the end of the buffer?

I’m trying to modify a modbus tcp application to use the zero copy interface. In modbus, the tx length is variable, anywhere from about 9 bytes up to 260 bytes. Before starting response generation I check that the buffer I’ve been given to put the response in is at least 260 bytes long. This works initially, but after a few dozen responses, the TX stream buffer’s write pointer gets within 260 bytes of the buffer end and my application errors out.

In the copy-based method, the TCP stack can split the data and handle the stream buffer wrapping back around, but my application can’t do that. I think it would be easiest for my application if I could reset the stream buffer back to the beginning instead of having to split the response into two parts.

I realize this might take some coordination with the task that’s reading data out of the stream buffer, but is this already possible with the existing stack? Or is this perhaps a common enough issue that it’s worth me trying to add it?

htibosch · January 8, 2024, 4:36pm

I used the TCP zero-copy methods in the FTP server which can be found here.

I wanted to retrieve only blocks from a file with a length of a multiple of 512 bytes.

You find my code here.

It says that:

If the TX space is 512 or more, read from disk using the TX stream buffer
Otherwise, use a temporary buffer pcFILE_BUFFER[] which is a multiple of 512 bytes, and use the copy-method.

I found that FTP uses zero-copy transfers most of the time. Only exceptionally, it will use the temporary buffer pcFILE_BUFFER[].

If that answers your question, we’re done now.

If you insist on using “TX Zero Copy” at all times, I think of the following:

The IP task is the reader of the TX stream buffer.
Your application is the writer of the TX stream buffer.

Suppose that your application sees that it can not send an entire packet, the space is less than 260 bytes. And suppose that all bytes have been sent, so :

    uxHead == uxTail == uxMid

I think that in this case you can call:

    vTaskSuspendAll();
    vStreamBufferClear( pxSocket->u.xTCP.txStream );
    xTaskResumeAll()

and after this call, you can use the full TX space again.

Please let us know what direction you will take.

michaelyoyo · January 8, 2024, 5:44pm

Thanks for the reply!

I played around with both methods, and I think the full time zero copy fits my goals best.

To test the interfaces, I timed a sequence of 10000 modbus queries from python, each test with a different tx interface. Full Copy, Part Time Zero Copy, and Full Time Zero Copy. I also checked the latency using wireshark, I just used the time between when the device ACKd the query packet, and when it sent the response packet. Each response was 257 bytes long. Zero copy was used for receive in all tests.

Type	Rate (rqsts/sec)	Latency (mS)
Full Copy	600	1
Part ZC	550	1.5
Full ZC	550	1.5

The test script didn’t report lost or invalid responses with any of the methods.

The goal of moving to zero copy was to, a) learn a bit more about how it works, and b) remove the need for allocating extra buffer space. I don’t like the performance hit it comes with, but since it lets me completely remove the additional buffer I’ll go for the full time zero copy. I appreciate the help!

Shub · January 9, 2024, 6:13am

Hi @michaelyoyo ,
Thanks for sharing the results.
As in Zero copy the application has to directly populate the stream buffer, Stream buffer size and number of stream buffers do make a difference on the CPU load and performance. Like smaller buffer size implies too many times the application has to fill the buffer and hence the TX task takes more CPU cycles. Hence, it’s always good to experiment a bit and arrive at the optimal combination of stream buffer size and number of buffers.

htibosch · January 9, 2024, 9:20am

Thanks for reporting back, but I’m not sure if I understand the results in your table:

Type        Rate (rqsts/sec)  Latency (mS)

Full Copy   600               1.0
Part ZC     550               1.5
Full ZC     550               1.5

Does Full Copy mean the normal way of sending data, where FreeRTOS_send() calls memcpy()?
Does the table say that (partial or full) zero-copy is slower than the old copy-method?

Would it be possible to show the code that calls FreeRTOS_send()?

michaelyoyo · January 11, 2024, 11:00pm

That’s correct, the zero copy method was slower than the standard method. I didn’t look into too much to figure out why, but here’s the loop as it was in the ‘full zc’ test.

    while(1) {
        BaseType_t rxLen = FreeRTOS_recv(xConnectSocket, (void*)&pRxBuf, configMODBUS_TCP_MAX_ADU_LEN, FREERTOS_ZERO_COPY);

        if (rxLen > 0) {
            FreeRTOS_recv(xConnectSocket, (void*)NULL, rxLen, 0);
            pTxBuf = FreeRTOS_get_tx_head(xConnectSocket, &xTxBufLen);

            if (pTxBuf == NULL) {
                continue;
            }

            // pTxBuf is a ring buffer, when we get close to the end of it,
            // we may not have enough space to fit a complete response.
            // Since modbus isn't able to break up the response, we wait until
            // all data is sent (the buffer tail catches up to its head)
            // and then reset the buffer to the beginning
            while (xTxBufLen < configMODBUS_TCP_MAX_ADU_LEN) {
                StreamBuffer_t* pTxStreamBuf = xConnectSocket->u.xTCP.txStream;
                if ((pTxStreamBuf->uxTail == pTxStreamBuf->uxMid)
                    && (pTxStreamBuf->uxMid == pTxStreamBuf->uxHead)) {

                    vTaskSuspendAll();
                    vStreamBufferClear( pTxStreamBuf );
                    xTaskResumeAll();
                } else {
                    vTaskDelay(1);
                }

                pTxBuf = FreeRTOS_get_tx_head(xConnectSocket, &xTxBufLen);
            }

            int32_t respLen = sPrvModbusTCPDoResponse(pRxBuf, rxLen, pTxBuf, configMODBUS_TCP_MAX_ADU_LEN);
            if(respLen > 0) {
                FreeRTOS_send(xConnectSocket, NULL, respLen, 0);
            }                
        } else if (rxLen < 0) {
            break;
        }
    }

kanherea · January 12, 2024, 12:07am

Hello @michaelyoyo,
Thank you for providing us with that code snippet.

That vTaskDelay(1) might have something to do with it
Imagine the TCP task finishing sending the last byte just as you call vTaskDelay(1). In this case, the TCP task and your application task will both wait for some time as TCP task has nothing to do and your task is stuck in vTaskDelay.

If this is the ONLY task you plan on having in your code (and there are no watchdog timers which need to be pet every so often), you can try removing that delay. TCP task will be higher priority and will run as long as it want and would even preempt your task once it is ready. You can busy loop until then. But do keep in mind that if you have other tasks, removing the delay will starve them which would be very bad. I am suggesting this for experimentation just because the zero copy method should be faster than the full copy method.

Thanks,
Aniruddha

htibosch · January 12, 2024, 5:33am

Thank you for the snippet of code. It looks perfect to me, except, as @kanherea wrote, for using vTaskDelay().

I would recommend using a semaphore instead, which is triggered by all socket events.

Here is a short “symbolic” example (not tested):

/* Remember to add this to your FreeRTOSIPConfig.h : */

#define ipconfigSOCKET_HAS_USER_SEMAPHORE    1

static SemaphoreHandle_t xSemaphore;

static void serverTask( Socket_t xConnectSocket )
{
    const BaseType_t xReceiveTimeOut = pdMS_TO_TICKS( 2 );
    size_t uxSpace;

    /* xSemaphore must be created with `xSemaphoreCreateBinary()`. */

    /* Connect 'xSemaphore' with 'xConnectSocket'. */
    FreeRTOS_setsockopt( xConnectSocket,
                         0,
                         FREERTOS_SO_SET_SEMAPHORE,
                         ( void * ) &xSemaphore,
                         sizeof( xSemaphore ) );
    StreamBuffer_t * pTxStreamBuf;

    /* Force to create the TX buffer. */
    configASSERT( FreeRTOS_get_tx_base( xConnectSocket ) );

    pTxStreamBuf = xConnectSocket->u.xTCP.txStream;

    BaseType_t rxLen = FreeRTOS_recv( xConnectSocket
                                      ( void * ) &pRxBuf,
                                      configMODBUS_TCP_MAX_ADU_LEN,
                                      FREERTOS_ZERO_COPY );

    if( rxLen > 0 )
    {
        FreeRTOS_recv( xConnectSocket, ( void * ) NULL, rxLen, 0 );

        pTxBuf = FreeRTOS_get_tx_head( xConnectSocket, &xTxBufLen );

        if( pTxBuf == NULL )
        {
            continue;
        }
        while( xTxBufLen < configMODBUS_TCP_MAX_ADU_LEN )
        {
            if( ( pTxStreamBuf->uxTail != 0U ) &&
                ( pTxStreamBuf->uxTail == pTxStreamBuf->uxMid ) &&
                ( pTxStreamBuf->uxMid == pTxStreamBuf->uxHead ) )
            {
                vTaskSuspendAll();
                vStreamBufferClear( pTxStreamBuf );
                xTaskResumeAll();
            }
            else
            {
                xSemaphoreTake( xSemaphore, xReceiveTimeOut );
            }
            pTxBuf = FreeRTOS_get_tx_head( xConnectSocket, &xTxBufLen );
        }
        int32_t respLen = sPrvModbusTCPDoResponse( pRxBuf,
                                                   rxLen,
                                                   pTxBuf,
                                                   configMODBUS_TCP_MAX_ADU_LEN );
        if( respLen > 0 )
		{
            FreeRTOS_send(xConnectSocket, NULL, respLen, 0);
        }                
        /* Fetch the data and send it. */
    }
}

When a semaphore is connected to a socket, any event can trigger (“give to”) the semaphore.
In this case you want the task to wake up in case a eSOCKET_SEND has occurred, which probably means that there is space in the TX stream buffer.

michaelyoyo · January 12, 2024, 5:39pm

Thanks for the suggestions.

The application I’m running has quite a lot of other processing going on. I ran the benchmarks again, with as much of the rest of the system disabled as I can. This time around, the rates were a bit more consistent and made more sense.

Type	Rate (rqsts/sec)
Full Copy	730
Part ZC	750
Full ZC (delay)	760
Full ZC (semaphore)	760

I added a counter that I could increment every time either vTaskDelay() or xSemaphoreTake() was called and found that in my tests it those functions were never called. I think this is because the testing script on my computer works sequentially, it sends a modbus request, waits for the response, and then sends the next request, so by the time recv() gets the next request it’s ensured that the previous response has been sent.

In any case, I wasn’t aware of how useful the user semaphore could be, I’ll use that here and keep it in mind for the future.

htibosch · January 13, 2024, 7:37am

Thanks for testing again!

@michaelyoyo wrote:

so by the time recv() gets the next request it’s ensured that the previous response has been sent

Ah yes of course. Client and server are perfectly in sync.

The semaphore method can be very helpful if you want to handle multiple sockets, streams and more in a single task.

Recently, I wrote a summary of all asynchronous techniques in FreeRTOS+TCP.

When using these techniques, it is preferred to disable the socket’s time-outs.