TCP stack: task starvation

tselmeci wrote on Monday, August 22, 2016:

Hello all!

I’d like to report an issue I’ve found in the recent weeks.

Kinetis K64F (Cortex M4F), FreeRTOS-8.2.0, TCP-160112.

The K64F is the server. There’s a persistent connection between the K64F and PC. The PC regularly sends 100-200 bytes long requests, and the K64F replies with packets smaller than 1500 bytes.

An HTTP server is also running and can serve 3 parallel requests. The PC is executing many parallel wgets to get the homepage on K64F at the highest pace possible.

When everything is up and running (1 persistent TCP connection + 3 continuously connecting, downloading and disconnecting HTTP clients) the machine gets rebooted by the HW watchdog after a while (usually within 1-5 minutes). The watchdog has a dedicated reset task, with a priority lower than the IP task’s priority.

I’ve figured out that prvCalculateSleepTime(…) function sometimes calculates a 0 tick of sleep time, thus it doesn’t let lower priority tasks getting CPU time. Even worse, it appears that if prvCalculateSleepTime(…) once calculates a zero ms of sleep time, it will never recover and will calculate this bad value until eternity (or until the watchdog resets ;)).

I’ve applied a simple fix to the end of the function:

	if (xMaximumSleepTime == 0)
        xMaximumSleepTime = pdMS_TO_TICKS(10);

And in my case this makes the trick, the system no longer gets starved.

I didn’t have time to dig into the code to find the real reasons of this; the fix above has been running for more than 10 days without a stop, so I hope it can’t be that bad… :wink:

rtel wrote on Monday, August 22, 2016:

Hi Tamas,

Thanks for taking the time to report this - we can try and replicate the
situation.

I presume a calculated sleep time of 0 will be legitimate in some
circumstances, so that in itself does not sound like a problem (I would
have to check), but there should not be a persistent time of 0 after
that but without being able to duplicate the problem first (and then
step through the code) I could not be sure.

heinbali01 wrote on Monday, August 22, 2016:

Hi Tamás,

There is a very recent post about the same subject.
In the upcoming Labs release this will be solved. Your patch will also work. I chose for a pdMS_TO_MIN_TICKS() that returns minimal 1

tselmeci wrote on Wednesday, August 24, 2016:

Hello Hein!

No, my issue has nothing to do with the tick frequency of FreeRTOS. Since we’ve figured out that my TCP problems occur due to the 200Hz setting, I’m using FreeRTOS at 1000Hz; the problem I described above is really independent of this.

rtel wrote on Wednesday, August 24, 2016:

An updated Labs release was uploaded yesterday. This new release is a maintenance release for existing code, rather than a release of the main development branch.

richard_damon wrote on Wednesday, August 24, 2016:

I don’t know if this is the case here, but this sounds very much like the problem that happens with a delay-until loop when it falls behind. It wants to do extra work right now to catch up, but that can cause starvation of lower priority tasks.

heinbali01 wrote on Wednesday, August 24, 2016:

Tamás, are you maybe using FreeRTOS_select() ?

If you let select() wake-up on a WRITE condition ( you may write to a socket ), then select will keep on returning without blocking until the output buffer is full.
The same for the other conditions: if it unblocks because of a socket has an exception ( connection closure ), and you omit to handle that exception, the next select() will keep on returning without sleeping.

heinbali01 wrote on Wednesday, August 24, 2016:

Another question:

when prvCalculateSleepTime() returns zero, can you say which timer had expired ?

Was it ARP, DHCP, TCP, or DNS.

If it was TCP, it looks much like the frequency problem.
Have you tried this solution already?

tselmeci wrote on Friday, August 26, 2016:

No, I’m not using FreeRTOS_select(…) at all. Only simple reads and writes on the socket, which has ~100-200ms timeout set.

I can’t tell you if it was ARP, DHCP, etc. timer, since I didn’t begin figuring it out.

My machine is running with 1000Hz tick; according to our earlier conversation in email, I thought I don’t have to fix the pdMS_TO_TICKS(…) issue (it may return 0) unless the tick rate isn’t 1000Hz.

So do you still recommend me to apply this small fix despite the FreeRTOS is running at 1000Hz on my machine? Anyways, since my primitive fix has been applied it appears to be stable in my current usage scenario…

heinbali01 wrote on Friday, August 26, 2016:

If you use a tickrate of 1000 Hz, it is indeed unlikely that applying the pdMS_TO_MIN_TICKS macro will change anything to the problem.

It would still be interesting to know for what protocol (ARP, DHCP, TCP, or DNS) xMaximumSleepTime was set to zero.

heinbali01 wrote on Wednesday, August 31, 2016:

After some further checking, we found a solution to the problem that Tamás reports here above, the starvation of lower-priority tasks.

Please change the following static-inline function in include/FreeRTOS_IP_Private.h :

    static portINLINE UBaseType_t uxGetRxEventCount( void )
    {
-        extern volatile UBaseType_t uxRxEventCount;
-        return uxRxEventCount;
+        return 0u;
    }

In other words, let this function always return zero.

Rationale :

When +TCP was first developed, we thought it would be advantageous to constantly keep track of how many RX packets are queued-up for the IP-task. It would allow the IP-task to give priority to RX processing above sending TCP packets.

The internal variable uxRxEventCount would keep track of the number of RX packets queued in xNetworkEventQueue.

The starvation: as long as uxRxEventCount was non-zero, the IP-task wouldn’t block. xTCPTimerCheck() in FreeRTOS_Socket.c would return 0 ticks:

if( uxGetRxEventCount() != 0u )
{
    /* This was interrupted, but want to be called as soon as
    possible to finish checking the other sockets. * /
    xShortest = ( TickType_t ) 0;
    break;
}

The above code is not needed at all: as long as the xNetworkEventQueue is non-empty, the IP-task won’t block unless another task has work to do at a higher priority.

In the latest release of +TCP, you will see in FreeRTOS_TCP_IP.c that uxGetRxEventCount() won’t be checked any more. That was thanks to Andrzej Burski, who noted that under some circumstances, the IP-task just stops sending TCP packets.

Tamás, thanks for reporting the above. In the next release, I think that uxRxEventCount will have disappeared altogether :slight_smile:

rtel wrote on Wednesday, August 31, 2016:

A 160831 release is now available with all use of the uxRxEventCount
variable removed.

tselmeci wrote on Thursday, September 01, 2016:

I can confirm that this little modification has solved my issues :slight_smile: