When using the TX zero copy interface, is there a way to force the TX stream buffer to reset or wrap around before the data reaches the end of the buffer?
I’m trying to modify a modbus tcp application to use the zero copy interface. In modbus, the tx length is variable, anywhere from about 9 bytes up to 260 bytes. Before starting response generation I check that the buffer I’ve been given to put the response in is at least 260 bytes long. This works initially, but after a few dozen responses, the TX stream buffer’s write pointer gets within 260 bytes of the buffer end and my application errors out.
In the copy-based method, the TCP stack can split the data and handle the stream buffer wrapping back around, but my application can’t do that. I think it would be easiest for my application if I could reset the stream buffer back to the beginning instead of having to split the response into two parts.
I realize this might take some coordination with the task that’s reading data out of the stream buffer, but is this already possible with the existing stack? Or is this perhaps a common enough issue that it’s worth me trying to add it?
If the TX space is 512 or more, read from disk using the TX stream buffer
Otherwise, use a temporary buffer pcFILE_BUFFER[] which is a multiple of 512 bytes, and use the copy-method.
I found that FTP uses zero-copy transfers most of the time. Only exceptionally, it will use the temporary buffer pcFILE_BUFFER[].
If that answers your question, we’re done now.
If you insist on using “TX Zero Copy” at all times, I think of the following:
The IP task is the reader of the TX stream buffer.
Your application is the writer of the TX stream buffer.
Suppose that your application sees that it can not send an entire packet, the space is less than 260 bytes. And suppose that all bytes have been sent, so :
I played around with both methods, and I think the full time zero copy fits my goals best.
To test the interfaces, I timed a sequence of 10000 modbus queries from python, each test with a different tx interface. Full Copy, Part Time Zero Copy, and Full Time Zero Copy. I also checked the latency using wireshark, I just used the time between when the device ACKd the query packet, and when it sent the response packet. Each response was 257 bytes long. Zero copy was used for receive in all tests.
Type
Rate (rqsts/sec)
Latency (mS)
Full Copy
600
1
Part ZC
550
1.5
Full ZC
550
1.5
The test script didn’t report lost or invalid responses with any of the methods.
The goal of moving to zero copy was to, a) learn a bit more about how it works, and b) remove the need for allocating extra buffer space. I don’t like the performance hit it comes with, but since it lets me completely remove the additional buffer I’ll go for the full time zero copy. I appreciate the help!
Hi @michaelyoyo ,
Thanks for sharing the results.
As in Zero copy the application has to directly populate the stream buffer, Stream buffer size and number of stream buffers do make a difference on the CPU load and performance. Like smaller buffer size implies too many times the application has to fill the buffer and hence the TX task takes more CPU cycles. Hence, it’s always good to experiment a bit and arrive at the optimal combination of stream buffer size and number of buffers.
Thanks for reporting back, but I’m not sure if I understand the results in your table:
Type Rate (rqsts/sec) Latency (mS)
Full Copy 600 1.0
Part ZC 550 1.5
Full ZC 550 1.5
Does Full Copy mean the normal way of sending data, where FreeRTOS_send() calls memcpy()?
Does the table say that (partial or full) zero-copy is slower than the old copy-method?
Would it be possible to show the code that calls FreeRTOS_send()?
That’s correct, the zero copy method was slower than the standard method. I didn’t look into too much to figure out why, but here’s the loop as it was in the ‘full zc’ test.
while(1) {
BaseType_t rxLen = FreeRTOS_recv(xConnectSocket, (void*)&pRxBuf, configMODBUS_TCP_MAX_ADU_LEN, FREERTOS_ZERO_COPY);
if (rxLen > 0) {
FreeRTOS_recv(xConnectSocket, (void*)NULL, rxLen, 0);
pTxBuf = FreeRTOS_get_tx_head(xConnectSocket, &xTxBufLen);
if (pTxBuf == NULL) {
continue;
}
// pTxBuf is a ring buffer, when we get close to the end of it,
// we may not have enough space to fit a complete response.
// Since modbus isn't able to break up the response, we wait until
// all data is sent (the buffer tail catches up to its head)
// and then reset the buffer to the beginning
while (xTxBufLen < configMODBUS_TCP_MAX_ADU_LEN) {
StreamBuffer_t* pTxStreamBuf = xConnectSocket->u.xTCP.txStream;
if ((pTxStreamBuf->uxTail == pTxStreamBuf->uxMid)
&& (pTxStreamBuf->uxMid == pTxStreamBuf->uxHead)) {
vTaskSuspendAll();
vStreamBufferClear( pTxStreamBuf );
xTaskResumeAll();
} else {
vTaskDelay(1);
}
pTxBuf = FreeRTOS_get_tx_head(xConnectSocket, &xTxBufLen);
}
int32_t respLen = sPrvModbusTCPDoResponse(pRxBuf, rxLen, pTxBuf, configMODBUS_TCP_MAX_ADU_LEN);
if(respLen > 0) {
FreeRTOS_send(xConnectSocket, NULL, respLen, 0);
}
} else if (rxLen < 0) {
break;
}
}
Hello @michaelyoyo,
Thank you for providing us with that code snippet.
That vTaskDelay(1) might have something to do with it
Imagine the TCP task finishing sending the last byte just as you call vTaskDelay(1). In this case, the TCP task and your application task will both wait for some time as TCP task has nothing to do and your task is stuck in vTaskDelay.
If this is the ONLY task you plan on having in your code (and there are no watchdog timers which need to be pet every so often), you can try removing that delay. TCP task will be higher priority and will run as long as it want and would even preempt your task once it is ready. You can busy loop until then. But do keep in mind that if you have other tasks, removing the delay will starve them which would be very bad. I am suggesting this for experimentation just because the zero copy method should be faster than the full copy method.
When a semaphore is connected to a socket, any event can trigger (“give to”) the semaphore.
In this case you want the task to wake up in case a eSOCKET_SEND has occurred, which probably means that there is space in the TX stream buffer.
The application I’m running has quite a lot of other processing going on. I ran the benchmarks again, with as much of the rest of the system disabled as I can. This time around, the rates were a bit more consistent and made more sense.
Type
Rate (rqsts/sec)
Full Copy
730
Part ZC
750
Full ZC (delay)
760
Full ZC (semaphore)
760
I added a counter that I could increment every time either vTaskDelay() or xSemaphoreTake() was called and found that in my tests it those functions were never called. I think this is because the testing script on my computer works sequentially, it sends a modbus request, waits for the response, and then sends the next request, so by the time recv() gets the next request it’s ensured that the previous response has been sent.
In any case, I wasn’t aware of how useful the user semaphore could be, I’ll use that here and keep it in mind for the future.