FreeRTOS TCP server recv() drops packets

Dear community,
I have a problem using the FreeRTOS TCP connection.

My application requires to send short packets of 13 bytes periodically (each second) between a TCP client (C++ application on PC) and a TCP server which is implemented using FreeRTOS TCP/IP stack.

This works fine for a random amount of time until suddenly one packet is apparently dropped by the server. I monitored the connection with wireshark and saw that the concerned packet is sent by the client and acknowledged by the FreeRTOS TCP stack but the recv() function oof the server won’t return any data, no matter how often I try and how long I wait.

If I send some more data, the server resums operation with this new data as usual, but the “lost” data remains lost forever. Just as if it was never received.

I tried a similar test scenario but with less bytes (4 bytes) and no concurrently running application in background which worked without dropping any data. I left it running way longer than I can run the other application until a packet is dropped.

The IP task priority is in both tests the highest priority among all tasks.
I am going to attach the FreeRTOSIPConfig.h so you can see all my configurations.
I enabled all debug switches but did not get any output when a packet was dropped.

The “network” is just one cable that goes from the PC to the device running FreeRTOS (Zynq FPGA), so there is no traffic at all except the one generated by the application itself.

Apparently the application has something to do with the issue… Can somebody give me a hint what I could check to resolve the problem? Has somebody experienced something similar?

Thanks in advance!

FreeRTOSIPConfig.h (20.8 KB)

Hello @Stonebull,

Welcome back to the FreeRTOS forums.

This problem seems interesting. Would you mind attaching the pcap file that Wireshark captured? Also, would you mind sharing the code that you are using (if it is not proprietary)?

Thanks,
Aniruddha

@kanherea Thanks for your reply! I am glad to attach the wireshark trace of the transactions where the error happend.

ErrorTrace.zip (804 Bytes)

Explenation of attached traces:



Unfortunately I cannot provide the source code easily. I am working to reproduce the error in a test project though. As soon as I succeed to reproduce the error I will attach it.

I hope that the trace can help you by any means to understand better whats going on.
If you have further questions please let me know, I will try to give as much information as possible without compromising our NDA.

Best Regards,
Emanuel

Hi Emanuel @Stonebull,

thanks for reporting this.

You wrote:

The IP task priority is in both tests the highest priority among all tasks.

We would recommend this priority scheme:

  • Highest: EMAC task from NetworkInterface.c
  • Medium: IP-task
  • Lower: Tasks that make use of the IP-stack

Although I don’t think that this caused your problem.

Software version: are you using the latest sources from the Github repo?
And if not, could you try it with the latest release?

I tried a similar test scenario but with less bytes (4 bytes) and no concurrently running application in background which worked without dropping any data

What do you mean with “no concurrently running application”, is that on the Zynq, or on the host (client side)?

The “network” is just one cable that goes from the PC to the device running FreeRTOS

I always recommend using a switch between the PC and the Zynq. Remember that when you connect the DUT (Zynq), on both sides of the cable PHY’s start negotiating.
Fortunately you have disabled ipconfigUSE_DHCP, because that would cause more conflicts (both devices might take the 0.0.0.0 address).
It looks like your PC has a fixed IP-address and does not use DHCP?
( using a switch is not a must, it is just more convenient while testing ).

Apparently the application has something to do with the issue…

you refer to the same application here? What application was it?

Thank you for the copy of FreeRTOSIPConfig.h, that is always very useful.

/* Set to 1 if the driver's transmit function is
* using zero copy. Otherwise set to 0. */
#define ipconfigZERO_COPY_TX_DRIVER 0

Can you set both these macros to 1:

#define ipconfigZERO_COPY_TX_DRIVER			1
#define ipconfigZERO_COPY_RX_DRIVER			1

Or did you disable ipconfigZERO_COPY_TX_DRIVER on purpose?
The driver supports zero-copy behaviour in both directions and that is the preferred method.

#define ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS 96

That is a lot of network buffers. It won’t cause problems, except that RAM memory is occupied by them.

Finally, I am most curious about “the other application”, how that influences the observations? Maybe you can start two or more applications and see if the problem gets worse?

Hi Hein @htibosch thanks for your extended reply.
I tried all your suggestions, unfortunately it did not solve my problem.

Thanks for pointing that out, I happened to use the same priority for the IP task and for the EMAC task. Although fixing that did not fix the issue of loosing packets.

No I didn’t do it on purpose. I am not sure how I ended up with this configuration. Either I tried various things the first time I started using the FreeRTOS TCP/IP stack and left it like that as I got it to work, or I used some example application as a template that had it set-up like this.
Anyway I changed both to ‘1’ but it did not make any difference.

Yes you are probably right it is too much, but we have plenty of RAM and it has not hurt me up to now.
How many network buffers would you recomend when keeping in mind that the biggest message size expected for RX or TX is around 500 bytes (very seldomly) and most of the time around 30 bytes (approx. 4 - 8 times per second).

Yes, you are right, nor the client side (PC) nor the server side (Zynq) use DHCP. For simplicity, I configured both to use fixed IPs so I did not need a switch. I connected the Zynq through a separate network interface so it’s totally disconnected from any router and privat network that may use DHCP or has any participant with a conflicting IP address.

Sorry for not expressing it clearly enough, by “concurrently running application” I refer to several FreeRTOS Tasks running cuncurrently, on the same CPU as the TCP/IP stack, so on the Zynq (server side).

Yes, I am refering to the FreeRTOS tasks that run in background on the Zynq. It’s a total of 9 tasks with variing priorities (athough all smaller than the IP task’s priority & the EMAC task’s priority). Most of them are sleeping, waiting for an event to happen.

I am not sure what you mean by “start more applications”. I guess you mean to run more or less FreeRTOS tasks in background to see if the increased or decreased load on the system changes something.
If that’s what you mean, I’ve tried something like that already. That’s what I was refering to as “Test Project” before. It is just a single task running on the Zynq and implementing the server code. This time the communication was always stable. Therefore I assumed that the “main” application has something to do with the issue of loosing packets, although I don’t know how yet.

No, I was using FreeRTOS+TCP V2.0.11. I first tried to apply your other suggestions, then tried to upgrade to the most recent stack version but still get the same issues.

Do you have any more ideas what I could try? If I you need some more information please feel free to ask.

Here is a short update: I investigated the dropped-packets-problem with @Stonebull. On the first sight, his TCP code looked OK to me.

But then I proposed to use a simple telnet module, and use this for responding to TCP clients.

He had it run for many hours and told me that no packets were dropped.
Emanuel is off for one week, so next week we will try to find out why his code would miss TCP packets.

2 Likes

Hi Emanuel,

just to rule out the common fallacies:

Can we see your application code? In particular - could it be that you do not check the return value of your recv() call and always assume that, say, you ask for 20 characters and always assume full packet reception if the return code is greater than 0, even if it is 10 instead of the expected 20? That is a VERY coomon fallacy.

Hi @RAc, thanks for your interest in the topic.

Unfortunately I cannot publish the whole application code. I will however upload the networking part alongside with an example application to act as a discussion basis.

Good point, but I do check the return value of recv(). If the value is negative an error is returned staight away. If the value is positive but smaller than the expected value and there is still time left to wait, recv() is called again for the missing bytes.
In those occasions where recv() drops data it returns just 0. Which is interpreted as “no new data arrived”.
Below you can see in short the relevant part of the code:

// Find out what event caused the select to return.
actualRecBytes = FreeRTOS_recv(pCtrl->socketClient,
							pTempDest,
							numBytes,
							FLAGS_NOT_IMPLEMENTED );
...
if(actualRecBytes == -pdFREERTOS_ERRNO_ENOTCONN 	//If the socket was closed or got disconnected
			||actualRecBytes == -pdFREERTOS_ERRNO_ENOMEM 	//If there was not enough memory for the socket to be able to create either an Rx or Tx stream
			||actualRecBytes == -pdFREERTOS_ERRNO_EINTR		//If the socket received a signal, causing the read operation to be aborted
			||(flag & eSELECT_EXCEPT))						//If the select function captioned that the socket got disconnected
{
		...

		retVal = RET_ERR_UPORT_IPSTACK_SOCKET_CLOSED;
		break;
}else if(actualRecBytes == 0)
{
		//A timeout of the socket occurred - return a timeout error
		retVal = RET_ERR_UPORT_TIMEOUT;
		break;
} else if(actualRecBytes <= numBytes)
{
		//Some data could be retrieved
		retVal = RET_OK;

		//Decrease number of numBytes by the number of received bytes & sum up to the number of received bytes
		numBytes -= actualRecBytes;
		*pBytesRead += actualRecBytes;

		//Set temporary pointer to new destination in order to keep the already received data from being overwritten
		pTempDest += actualRecBytes;

		//Reset Actual Byte Count
		actualRecBytes = 0;
}

If you wish to see the whole code please have a look to the attached files. I want to avoid to spam the chat by putting all in here.


Looking further into the problem I can report some findings:
I registerred a FREERTOS_SO_WAKEUP_CALLBACK where on @htibosch advice I peak the recv() data if there is some unread data using the option FREERTOS_MSG_PEEK. By doing so I could see that the data bytes that are occasionally dropped by recv() are actually inside the receive buffer at one point in time as they can be peaked successfully. They just disappear later on when I try to actually read them.

I suspect that somewhere after the callback, the pxBuffer->uxTail is moved forward making the data packet impossible to retrieve for recv() when calling it from UportRead().

In order to track down if and when the tail is moved “unintentionally”, I placed a watchpoint on the address of pCtrl->socketClient->u.xTCP.rxStream->uxTail that should be triggered as soon as the tail is moved. I also added a watchpoint condition that prevents it from triggering when the tail is moved by the recv() function “intentionally”.

Unfortunately, as soon as I place the watchpoint the issue of losing a packet just does not happen anymore. I cannot tell it for sure that the placement of the watchpoint is the reason for that but I did not change anything else and the problem is just not showing up, no matter how many packets I send.
(I’ve currently tried with around 18.000 packets ~ 5h runtime without a single fail. Whereas without the watchpoint it tends to fail between 1 - 5 minutes.)

This is of course unacceptable as solution. It just prevents me from troubleshooting further with watchpoints.

@RAc what is your impression so far?

Edit: Forgot to add the project
ConceptProject.zip (18.9 KB)

Best Regards,
Emanuel

Hello!
I could somewhat track down where the problem happens and why adding a watchpoint fixes it.

I noticed that when I added the watchpoint mentioned in the last post, (those that monitors write accesses to pxBuffer->uxTail) the latency of my pings that I keep sending once a second from the client (PC) to the server (Zynq APU with FreeRTOS stack) increases.
PingLatency

I assume that this latency is caused by the processor being halted shortly whenever the ring buffer variable pxBuffer->uxTail is changed. Then the debugger checks the conditions and silently resumes operation as the break conditions that I defined are not met. The user does not notice that the processor was halted for a short amount of time - the only visible result is the increased ping latency.

So as mentioned, the watchpoint looks out for write operations to the ring buffer variable pxBuffer->uxTail. This happens in the function uxStreamBufferGet() defined in FreeRTOS_Stream_Buffer.c. Therefore it can be assumed that the program is shortly halted whenever ther processor passes over this line of code. This means that adding this watchpoint essentialy adds a wait statement there.

To test out my assumptions I removed the watchpoint and added vTaskDelay(pdMS_TO_MIN_TICKS(100)); instead.

And so far I am not getting any data loss problems anymore!

I added the function with the changes I made for reference here:

/**
 * @brief Read bytes from stream buffer.
 *
 * @param[in] pxBuffer: The buffer from which the bytes will be read.
 * @param[in] uxOffset: can be used to read data located at a certain offset from 'lTail'.
 * @param[in,out] pucData: If 'pucData' equals NULL, the function is called to advance 'lTail' only.
 * @param[in] uxMaxCount: The number of bytes to read.
 * @param[in] xPeek: if 'xPeek' is pdTRUE, or if 'uxOffset' is non-zero, the 'lTail' pointer will
 *                   not be advanced.
 *
 * @return The count of the bytes read.
 */
size_t uxStreamBufferGet( StreamBuffer_t * pxBuffer,
                          size_t uxOffset,
                          uint8_t * pucData,
                          size_t uxMaxCount,
                          BaseType_t xPeek )
{
    size_t uxSize, uxCount, uxFirst, uxNextTail;

    /* How much data is available? */
    uxSize = uxStreamBufferGetSize( pxBuffer );

    if( uxSize > uxOffset )
    {
        uxSize -= uxOffset;
    }
    else
    {
        uxSize = 0U;
    }

    /* Use the minimum of the wanted bytes and the available bytes. */
    uxCount = FreeRTOS_min_size_t( uxSize, uxMaxCount );

    if( uxCount > 0U )
    {
        uxNextTail = pxBuffer->uxTail;

        if( uxOffset != 0U )
        {
            uxNextTail += uxOffset;

            if( uxNextTail >= pxBuffer->LENGTH )
            {
                uxNextTail -= pxBuffer->LENGTH;
            }
        }

        if( pucData != NULL )
        {
            /* Calculate the number of bytes that can be read - which may be
             * less than the number wanted if the data wraps around to the start of
             * the buffer. */
            uxFirst = FreeRTOS_min_size_t( pxBuffer->LENGTH - uxNextTail, uxCount );

            /* Obtain the number of bytes it is possible to obtain in the first
             * read. */
            ( void ) memcpy( pucData, &( pxBuffer->ucArray[ uxNextTail ] ), uxFirst );

            /* If the total number of wanted bytes is greater than the number
             * that could be read in the first read... */
            if( uxCount > uxFirst )
            {
                /*...then read the remaining bytes from the start of the buffer. */
                ( void ) memcpy( &( pucData[ uxFirst ] ), pxBuffer->ucArray, uxCount - uxFirst );
            }
        }

        if( ( xPeek == pdFALSE ) && ( uxOffset == 0U ) )
        {
            /* Move the tail pointer to effectively remove the data read from
             * the buffer. */
            uxNextTail += uxCount;

            if( uxNextTail >= pxBuffer->LENGTH )
            {
                uxNextTail -= pxBuffer->LENGTH;
            }

            //DEBUG ES
            vTaskDelay(pdMS_TO_MIN_TICKS(100));
            //DEBUG END

            pxBuffer->uxTail = uxNextTail;

        }
    }

    return uxCount;
}

If adding this delay really solves the problem I assume that some kind of race condition happens inside the FreeRTOS stack.

Can anybody confirm that all this makes sens and maybe even suggest a solution?
Adding a delay can only be a short term workaround as it certainly is not a reliable solution.

Are you using uncached memory for your buffers?

@rtel I am not exactly sure which buffers you are refering to.
If you mean memory space in RAM I would say no, they should be cached as the zynq accesses its DDR memory via the L2 cache controller.

@rtel wrote:

Are you using uncached memory for your buffers?

I think that Richard is referring to the type of memory that is used for the DMA buffers, please see uncached_memory.c. It makes sure that all DMA descriptors are stored in uncached memory.

If you use the driver in “portable/NetworkInterface/Zynq”, then the buffers are declared static in the function vNetworkInterfaceAllocateRAMToBuffers, and thus they are cached.
The driver knows that and will call the necessary DCache functions.

Reading you source code, I would recommend to simplify it a lot.

@htibosch, @rtel

Ah I see, thanks for clarification, it looks like the buffers are allocated uncached, or am I mistaken?

XStatus init_dma( xemacpsif_s * xemacpsif )
{
    NetworkBufferDescriptor_t * pxBuffer;

    int iIndex;
    UBaseType_t xRxSize;
    UBaseType_t xTxSize;
    struct xtopology_t * xtopologyp = &xXTopology;

    xRxSize = ipconfigNIC_N_RX_DESC * sizeof( xemacpsif->rxSegments[ 0 ] );

    xTxSize = ipconfigNIC_N_TX_DESC * sizeof( xemacpsif->txSegments[ 0 ] );

    xemacpsif->uTxUnitSize = dmaRX_TX_BUFFER_SIZE;

    /*
     * We allocate 65536 bytes for RX BDs which can accommodate a
     * maximum of 8192 BDs which is much more than any application
     * will ever need.
     */
    xemacpsif->rxSegments = ( struct xBD_TYPE * ) ( pucGetUncachedMemory( xRxSize ) );
    xemacpsif->txSegments = ( struct xBD_TYPE * ) ( pucGetUncachedMemory( xTxSize ) );
    xemacpsif->tx_space = ( unsigned char * ) ( pucGetUncachedMemory( ipconfigNIC_N_TX_DESC * xemacpsif->uTxUnitSize ) );
(...)

The function vNetworkInterfaceAllocateRAMToBuffers() is not called anywhere in my project as I am using the “BufferAllocation_2.c”. So I guess that the memory of the DMA buffers is indeed uncached. Now is this a good thing or a bad thing? Sorry for my scarse knowledge, but how can this affect the behaviour? Does it have something to do with the cache being not up-to-date?

Can you give me some hints on what you would improve?


Do you have a clue why adding a delay inside recv() fixed the problem for now?

Thanks a lot,
Emanuel

Normally, when there is plenty of space in RAM, BufferAllocation_1 is used.
The _2 version uses malloc to allocate the exact amount of bytes.
For the Zynq, in both schemes, the memory is allocated in SDRAM, which is accessed throught the L2 cache.
The driver is aware of that.
Yet, why not try the _1 allocation in your project?
If I were you, I would strip the project: remove the mutex, the queue and also the socket set.
You can take the telnet driver an example of how to create the read/write functions.
When that works 100%, you can add other features and see if and where it goes wrong.

I can try of course to use BufferAllocation_1. I just chosed BufferAllocation_2 because it is recomended to do so by the readme.txt of the FreeRTOS-Plus-TCP library.

At this time it is recommended to use BufferAllocation_2.c in which case it is
essential to use the heap_4.c memory allocation scheme:
FreeRTOS - Memory management options for the FreeRTOS small footprint, professional grade, real time kernel (scheduler)

Thanks for your advices, I guess I’ll have no choice than re-do the whole TCP/IP port that I wrote previously in order to hopefully spot the problem…

Hello,
I have a short question regarding the behaviour of recv() in FreeRTOS.

The documentation states:

If the receive operation cannot complete immediately because there is no data queued on the socket to receive then the calling RTOS task will be held in the Blocked state (so that other tasks can execute) until either data has been received, or the timeout expires.

Does this mean that the recv() function blocks until either all requested data is present or the timeout occurs first. Or does the recv() function return as soon as some data is present, even if the timeout has not been exceeded yet?

Assuming that the data requested maybe does not always come in one go, is is possible that recv() returns with less data than requested but still time left to wait?

Thanks for your help.

Best Regards,
Emanuel

Good question: recv() returns immediately as soon as at least one byte has been received.

On the contrary, send() will try to send all bytes and it will also block when necessary.

Now if the received packet is 32 bytes long (or <= MSS), it is very unlikely but possible that a packet is transmitted in 2 or more parts.

The simplest code would be like this:


BaseType_t receivePacket( char * pcBuffer, BaseType_t xExpected )
{
    BaseType_t xReturn = 0;
    BaseType_t xResult = 0;
    do
    {
        xResult = FreeRTOS_recv( xSocket, ( void * )pcBuffer, xExpected, 0 );
        if( xResult <= 0 )
        {
            break;
        }
        pcBuffer += xResult;
        xReturn += xResult;
        /* FreeRTOS_recv() is not allowed to return more bytes than we asked for. */
        configASSERT( xResult <= xExpected );
        xExpected -= xResult;
    }
    while( xExpected > 0 );

    if( ( xResult < 0 ) &&
        ( xResult != -pdFREERTOS_ERRNO_EAGAIN )  &&
        ( xResult != -pdFREERTOS_ERRNO_EINTR ) )
    {
        /* This is a fatal error or a disconnection. */
        xReturn = -1;
    }
    else
    {
        /* The number of bytes received will be returned. */
    }
    return xReturn;
}
1 Like

Thanks for clarification!

I thought so earlier and implemented a similar mechanism… While re-doing my code and trying to simplify it as far as possible to track down the problem of missing packets, I suddenly had doubts on that, but seems like I’ll need it.

Best Regards,
Emanuel

I think I just solved the package loss issue.

It had nothing to do with FreeRTOS or the FreeRTOS network stack. I messed up in the UportRead() API. :man_facepalming:

In case you want to know the details I write a short explenation below.

The thing that screwed me was that I still checked for timeout expiry after receiving data with recv(). In case the timeout was exceeded, the API dropped the whole recieved data and returned a timeout error.
When I wrote the lib I argued that, if the timeout gets exceeded, the data should be dropped as it did not arrive “in time” anyway. But I did not keep in mind that this data is lost by doing so, as it has already been read from the network buffer in the first place. :man_facepalming:

Anyway, now at least the issue is solved and nobody is to blame but me.

Thanks for your time and help!

Best Regards,
Emanuel

That is good news, Emanual.

Sometimes I also follow the procedure that I recommended to you: eliminating parts of the code until the problem disappears. It is a tiring but sure way to a solution.

Thanks for your time and help!

You are welcome, and thanks for reporting back.

2 Likes