xSemaphoreTake fails before timeout

nobody wrote on Tuesday, February 14, 2006:

Hi,

I’m having a problem that I’ve traced to xSemaphoreTake returning failure right away, despite the fact that the semaphore is initialized properly and I tell it to block a long time.

The weird thing is that the following code works:

while(xSemaphoreTake(sem, portMAX_DELAY) != pdTRUE)
  ; //do nothing

but this code occasionally returns failure right away:

xSemaphoreTake(sem, portMAX_DELAY);

It seems like a race condition because it only fails once in a while, at unpredictable times.

I’m using this semaphore as a lock on a data structure.  The problem only occurs when I have another task running that receives frequently (faster than 1000Hz) on a different queue.  The messages on the different queue are posted from an ISR with xQueueSendFromISR.

Any ideas?

nobody wrote on Wednesday, February 15, 2006:

Which processor and compiler are you using?  If the while loop version works all the time I would have thought it unlikely that a race condition is the cause as the xSemaphoreTake works in exactly the same way.

A small test program that demonstrates the problem would be good.

nobody wrote on Wednesday, February 15, 2006:

I’m using the IAR tools with an at91SAM7S256. 

I created a pretty simple test that demonstrates the error.  After I start the scheduler, my main thread makes the following calls:

vSemaphoreCreateBinary(mux); //mux is a global xSemaphoreHandle

  xTaskCreate(thread, "t1", DEFAULT_STACK_SIZE, NULL, NORMAL_PRIORITY, ( xTaskHandle * ) NULL );
  xTaskCreate(thread, "t2", DEFAULT_STACK_SIZE, NULL, NORMAL_PRIORITY, ( xTaskHandle * ) NULL );

Here’s the code for each thread:

portTASK_FUNCTION( thread, pvParameters )
{
  volatile unsigned int failCount = 0;
  volatile unsigned int successCount = 0;

  while(1)
  {
    if(xSemaphoreTake(mux, portMAX_DELAY) != pdTRUE)
    {
      failCount++;
    }
    else
    {
      successCount++;
      xSemaphoreGive(mux);
    }
  }
}

So I just have two threads taking and giving a semaphore constantly.  Using my debugger, I found that after a few seconds, successCount was 70890 and failCount was 104 in one of the threads.

My understanding of the API is that with the block time set to portMAX_DELAY, xSemaphoreTake will never return without successfully acquiring the semaphore, but there it is.  The fact that it fails less than 1% of the time makes me think there is some sort of race going on.

rtel wrote on Thursday, February 16, 2006:

I can replicate this on the PC port and will investigate further.

rtel wrote on Thursday, February 16, 2006:

Could you try the following:

+ Go to function xQueueSend() in queue.c.
+ Locate the comment text "When we are here it is possible that we unblocked" half way through the function.
+ Four lines above this text is a call to taskEXIT_CRITICAL().  Comment this line out.
+ Immediately below the end of the comment is a call to taskENTER_CRITICAL().  Comment this line out.

+ Go to function xQueueReceive() in queue.c.
+ Do the same here - comment out the EXIT/ENTER critical that are the 32nd and 26th lines in the function.

See if this fixes your problem.

Regards.

nobody wrote on Thursday, February 16, 2006:

32nd and 36th lines into the function (not 32nd and 26th)

rtel wrote on Thursday, February 16, 2006:

Scrub that - its completely wrong.

I will take another look.

rtel wrote on Thursday, February 16, 2006:

Ok, this is what I’m seeing.

Take two tasks t1 and t2 that run the sample code you supplied.

1) t1 attempts to get the semaphore but cannot so blocks with block time portMAX_DELAY.

2) t2 runs and gives the semaphore.  Giving the semaphore causes t1 to unblock.  HOWEVER t1 and t2 are the same priority so no context switch occurs.

3) t2 continues to execute.  In the tight loop it immediately gets the semaphore again. 

4) Now there is a context switch.  t1 was unblocked when t2 gave the semaphore so now executes.  However, since it unblocked t2 has taken the semaphore again so t1 returns pdFAIL to indicate that it could not successfully obtain the semaphore even though it unblocked.

It appears that t2 is unblocking too early, whereas in fact it unblocked at the correct time, its just that other things happened in between.

If t1 was of higher priority it would of executed prior to t2 taking the semaphore again.

Does this explain what you are seeing in your your application as well as the test code you provided?

Regards.

nuncio wrote on Thursday, February 16, 2006:

So therefore a possible solution would be to add something to the tight loop (in either task), such as a call to taskYIELD or vTaskDelay.  This would force a context switch.
Does that sounds about right?

nobody wrote on Thursday, February 16, 2006:

Thanks for such a quick response!

This does indeed explain what I’m seeing.  A partial workaround would be to keep attempting to take the semaphore until it returns pdTRUE, but I don’t think that really fixes the problem.

In my program, I have tight loops around a semaphore used to lock a device, and I need the two threads to alternate access.  I think with the current semaphore policy, one thread may get to take the semaphore many times before the other succeeds in taking it.

Perhaps the right fix is to make t1 get the semaphore at step 2 instead of just being unblocked.  Then t2 can continue to execute (no context switch) as before, but it has to stop when it tries to take the semaphore, since t1 is holding it.  At that point t1 gets to run.

What do you think?

Thanks again for figuring this out so fast!

nobody wrote on Thursday, February 16, 2006:

Well, that would fix it, but I still think the semaphore isn’t behaving the way the API says it does.  It shouldn’t be possible for the semaphore to return pdFAIL before the timeout period is up.

nobody wrote on Thursday, February 16, 2006:

Not sure I agree with this statement.  The task unblocks when it should.  The API has no regard for how you choose to prioritise tasks within your application.  But then again…

An alternative change you could make should you wish to change the behaviour is to cause a context switch when a task unblocks if the unblocked task has a priority >= the currently running task.  Currently a context switch will only occur if the unblocked task has a priority > than the currently running task.

nobody wrote on Thursday, February 16, 2006:

Well, you would have to change to API to say that xSemaphoreTake returns pdFALSE after xBlockTime *or less* if it fails to acquire the semaphore.  But I see your point.

My opinion is that it’s ok for the API to not specify which task gets unblocked when a semaphore is given, but the ones that don’t get the semaphore shouldn’t unblock.  It should unblock with pdTRUE or not at all.

colin_howells wrote on Friday, April 07, 2006:

Hi, I noticed this problem yesterday and tried to fix it by placing a loop in xQueueReceive, I just keep looping until the wait time has expired or the queue has a message posted:

signed portBASE_TYPE xQueueReceive( xQueueHandle pxQueue, void *pcBuffer, portTickType xTicksToWait )
{
signed portBASE_TYPE xReturn;
portTickType xTicksAtStart;
portTickType xTicksBlockedFor;

   /*
    work around for the race condition
   */
  xTicksAtStart= xTaskGetTickCount();
  xTicksBlockedFor = 0;
    /* This function is very similar to xQueueSend().  See comments within
    xQueueSend() for a more detailed explanation.

    Make sure other tasks do not access the queue. */
    vTaskSuspendAll();

    /* Make sure interrupts do not access the queue. */
    prvLockQueue( pxQueue );

    /* If there are no messages in the queue we may have to block. */
    while( prvIsQueueEmpty( pxQueue ) && xTicksBlockedFor < xTicksToWait )
    {
        /* There are no messages in the queue, do we want to block or just
        leave with nothing? */           
        if( (xTicksToWait - xTicksBlockedFor) > ( portTickType ) 0 )
        {
            vTaskPlaceOnEventList( &( pxQueue->xTasksWaitingToReceive ), xTicksToWait - xTicksBlockedFor );
          xTicksBlockedFor = xTaskGetTickCount()-xTicksAtStart;

It does “seem” to work although I’m sure if pre-emption is enabled the fault can still occur i.e. if a tick interrupt happens between the while loop and the next critical region causing a context switch.
  Apart from the above I think i’ve been a bit naive and have probably missed something blindingly obvious.

imajeff wrote on Tuesday, April 11, 2006:

Then it sounds to me like what you need is for each task to yield immediately after it gives the semaphore.