Missing a semaphore under load

dspaude wrote on Wednesday, October 10, 2007:

I have found a problem where my system is missing the execution of a task due to missing the signaling of a semaphore that would allow the task to continue. Per suggestions on this forum I am using semaphores for all of my interrupts because resuming/suspending isn’t the best for events that may trigger while the task is executing. So, in my interrupt service routine I call xSemaphoreGiveFromISR() with the semaphore queue handle and in the task I wait for the semaphore using xSemaphoreTake() with portMAX_DELAY.

Here is what I see when the system is idle or under light load where I get an interrupt every 16 ms:
[00, 00:01:41.645]  xhfc_ISR            (interrupt event)
[00, 00:01:41.645]  xhfc_ISR: S_XHFC    (semaphore signaled from ISR)
[00, 00:01:41.645]  ** ENTER XHFC **    (task receives semaphore signal)
[00, 00:01:41.645]  ** EXIT XHFC **     (task waits for another semaphore)
[00, 00:01:41.660]  xhfc_ISR
[00, 00:01:41.660]  xhfc_ISR: S_XHFC
[00, 00:01:41.660]  ** ENTER XHFC **
[00, 00:01:41.660]  ** EXIT XHFC **
[00, 00:01:41.675]  xhfc_ISR
[00, 00:01:41.675]  xhfc_ISR: S_XHFC
[00, 00:01:41.675]  ** ENTER XHFC **
[00, 00:01:41.675]  ** EXIT XHFC **
[00, 00:01:41.690]  xhfc_ISR
[00, 00:01:41.690]  xhfc_ISR: S_XHFC
[00, 00:01:41.690]  ** ENTER XHFC **
[00, 00:01:41.690]  ** EXIT XHFC **

However, under load this is what I see (interrupt is still every 16 ms):
[00, 00:01:41.995]  xhfc_ISR
[00, 00:01:41.995]  xhfc_ISR: S_XHFC
[00, 00:01:41.995]  ** ENTER XHFC **
[00, 00:01:41.995]  ** EXIT XHFC **
[00, 00:01:42.010]  xhfc_ISR
[00, 00:01:42.010]  xhfc_ISR: S_XHFC
[00, 00:01:42.025]  xhfc_ISR
[00, 00:01:42.025]  xhfc_ISR: S_XHFC
[00, 00:01:42.030]  ** ENTER XHFC **

The OS missed executing the task which in my case causes a receiver overrun. There is only one task that has the same priority as this task and that task is not receiving much data. However, there are many other lower-priority tasks that are very busy.

I am using the 4.5.0 version of FreeRTOS and I am going to try 4.4.0 because I noticed there were some changes to the semaphore/queue code. Any other tips are appreciated, though.

davedoors wrote on Thursday, October 11, 2007:

Maybe only an indirect response to the question, but how are you generating this nice trace output?  I’m interested because first it looks very useful and second because maybe it has something to do with your problem?

dspaude wrote on Thursday, October 11, 2007:

The trace output is part of a logging system we have that stores strings, numbers, streams of bytes, etc. to a specific RAM location for later retrieval (with a timestamp added upon message storage). It isn’t very complex and it is very useful.

The trace messages do not cause the problem because I added those messages in my tracking down of the problem. Normally there are no trace messages output unless there is an error or catastrophe.

I did try FreeRTOS 4.4.0 yesterday and that version did NOT miss executing the task. The task WAS delayed by about 5 ms and did result in an overrun, but at least it didn’t miss executing the task. 16 ms is a lot of time between tasks and this particular task (XHFC) takes up to 1.5 ms to execute, so well under the limit (and my timer resolution for task switching is 5 ms, but there is only one other task at the same priority (none higher) and that task is rarely getting traffic in this particular test).

Anyway, might there be an issue with the semaphore system in 4.5.0?

rtel wrote on Thursday, October 11, 2007:

>Anyway, might there be an issue with the semaphore system in 4.5.0?

I cannot say for sure - but the changes between the two versions were more additions rather than changing the way the existing implementations worked - so I doubt there would be any changes that could cause this (especially if you have configUSE_MUTEXES set to 0).

Which port and tools are you using?   Are you able to put together a small demo that demonstrates this behaviour?

Regards.

chaalar wrote on Thursday, October 11, 2007:

One point caught my attention in your trace output…


[00, 00:01:42.010] xhfc_ISR
[00, 00:01:42.010] xhfc_ISR: S_XHFC
[00, 00:01:42.025] xhfc_ISR #1
[00, 00:01:42.025] xhfc_ISR: S_XHFC
[00, 00:01:42.030] ** ENTER XHFC ** #2

We see that a semaphore missed in #1. But in #2 we see
that it takes 5 msec for your task to receive the semaphore.
However, in light load it case it receives the semaphore within a msec.
Is it possible that your interrupt does not causing a context switching? One may ask why
it’s not happening in the first case, but I suspect that your interrupt has been
signalling (most of the time) just before your tick timer interrupt(which is 5 msec)
which gives me the idea that your ISR does not cause a context switch.

Yusuf Caglar AKYUZ

dspaude wrote on Thursday, October 11, 2007:

I am using the AT91SAM7S port with GCC 4.2.1. Since this problem appears related to testing the whole system I don’t believe I can generate a simple test application for you. In reality I am using the AT91SAM7SE with external SDRAM and we use a lot of memory (8 MB available and much of it used) and a lot of code (over 400 KB).

With the timestamp it isn’t uncommon for the interrupt time to indicate it was 5 ms before the task executed. In reality the timestamp could have changed right before the task executed and stored the trace message (meaning there could have been between a few microseconds up to almost 10 ms between the ISR and the task depending on where in the timing the trace messages were recorded).

However, I do agree that there is some lag when the system is dealing with more data than during the idle period. 16 ms is a long time, though, and for a task of the highest priority it should have executed well before the next interrupt (so in this case it was 32 ms later that the task finally executed). That is why I think the task failed to receive the semaphore (should have executed within 2 ms at the most, not 32 ms). The XHFC task itself only takes about 1.5 ms at the most. There are many other tasks being handled, but at a lower priority (-2 compared to the XHFC task).