Bug? configASSERT ( xInheritanceOccurred == pdFALSE ) in xQueueSemaphoreTake

Erlkoenig90 · October 5, 2022, 4:43pm

Hi, I get semi-random crashes on an ESP32S3 dual-core platform (ESP-IDF v4.4.2) with the following message:

assert failed: xQueueSemaphoreTake queue.c:1624 (xInheritanceOccurred == ( ( BaseType_t ) 0 ))

This happens rather irregularly, but always on the semaphore protecting the logging function (see log_freertos.c in ESP-IDF). This semaphore is taken very frequently from various threads. In most cases, it is returned quickly as the logs are disabled, but the semaphore is needed to query enabled/disabled state. The tasks are running on 2 cores. I’m having difficulty reproducing this in a minimal example, it only happens in a complex code base that interacts via USB with a PC software; I suspect the particular timing induced by the USB communication triggers this.

The code in xQueueSemaphoreTake looks like this:

/* For inheritance to have occurred there must have been an
 * initial timeout, and an adjusted timeout cannot become 0, as
 * if it were 0 the function would have exited. */
#if ( configUSE_MUTEXES == 1 )
	{
		configASSERT( xInheritanceOccurred == pdFALSE );
	}
#endif /* configUSE_MUTEXES */

However, I think that comment is wrong: If the thread currently holding the semaphore has lower priority than the current one, priority inheritance is performed and xInheritanceOccurred is set to pdTRUE. If the other thread does not release the semaphore yet, we run into the timeout (queue.c:1719). The queue is then unlocked (on ESP: spinlock is released). In this precise moment, the other thread might release the mutex. Therefore, the return (queue.c:1758) does not happen. Instead, the outermost loop is restarted, but xTicksToWait is now 0 but xInheritanceOccurred is still pdTRUE.

I think this assertion should simply be removed, as the reasoning behind it is wrong (“For inheritance to have occurred there must have been an initial timeout, and an adjusted timeout cannot become 0, as if it were 0 the function would have exited” - not true, the loop can also restart if the semaphore was just returned). Instead, we should just return errQUEUE_EMPTY (queue.c:1632).

What do you think about this reasoning? Is this indeed a bug or is something else wrong?

The line numbers refer to ESP-IDF v4.4.2 components/freertos/queue.c, sorry I can’t post links or I’d have linked them…

skptak · October 10, 2022, 8:33pm

Hey Niklas, thanks for creating your first post on the forum! Sorry for the delay in getting back to you, but I’d like to try and help you with your issue.
Since it seems you’ve put some extensive thought into this I was wondering if you had tried removing that configASSERT( xInheritanceOccurred == pdFALSE ); call and if that removed the issue you are facing. Furthermore, if it did, did you notice any other problems occur?
I’ve reached out to a member of the kernel team to look further into this and hope to have a response for you shortly.
Thanks for bringing this to our attention!

Erlkoenig90 · October 11, 2022, 8:27am

Hi, thanks for your reply.

Yes, removing that configASSERT( xInheritanceOccurred == pdFALSE ) worked fine. I replaced it with

if (xInheritanceOccurred != pdFALSE) {
  __asm__ volatile ("nop");
}

and placed a breakpoint on the nop. The breakpoint gets triggered, but the application continues to work fine. So I think that the assertion can be safely removed.

rtel · October 11, 2022, 2:29pm

Can you please describe a valid step by step sequence that incorrectly triggers that assert so I can understand better. Your original post mentions a sequence, but it’s not clear to me. Thanks.

Erlkoenig90 · October 12, 2022, 8:36am

Hi Richard, here’s my attempt at an execution sequence:

Thread A, priority 4, runs on Core 0.
Thread B, priority 5, runs on Core 1.
Thread A acquires the semaphore successfully.
Thread B attempts to acquire the semaphore with a call to xQueueSemaphoreTake and specifies a timeout, e.g. 10.
Thread B can’t acquire the semaphore yet, and goes to line 1638 and then checks for timeout in 1661.
Timeout hasn’t elapsed yet, so priority inheritance is performed (1677) and xInheritanceOccurred is set to true. Thread B then yields (1696).
Thread A releases the semaphore.
Thread B is awaken from the yield, and begins to continue the outer loop (1562).
Thread A re-acquires the semaphore before B enters the loop body.
Thread B sees that xTicksToWait isn’t zero yet, and continues with checking the timeout (1661).
The timeout has now elapsed, so xTicksToWait is set to 0 and Thread B continues with 1720.
Thread A releases the semaphore before Thread B continues
Thread B sees the queue count is 1 (in 1731), so doesn’t return but goes to 1762 and restarts the loop.
While Thread B starts from the beginning, Thread A re-acquires the semaphore again.
xTicksToWait is 0, so Thread B goes to 1624 and the assertion is triggered.

It is pretty convoluted and can only happen this way on a multi-core system.
Sorry if I missed something, it’s my first time using FreeRTOS and I don’t really have another explanation why the assertion is triggered.

aggarg · October 12, 2022, 11:23am

To make it more readable, let us consider the following pseudo code -

 1:  for ( ;; )
 2:  {
 3:      taskENTER_CRITICAL();
 4:      {
 5:          if( semaphore is available )
 6:          {
 7:              grab it and return
 8:          }
 9:          else
10:          {
11:              if( xTicksToWait == 0 )
12:              {
13:                  configASSERT( xInheritanceOccurred == pdFALSE );
14:              }
15:          }
16:      }
17:      taskEXIT_CRITICAL();
18:  
19:      vTaskSuspendAll();
20:      if( timed out has not yet expired )
21:      {
22:          if( semaphore is not available )
23:          {
24:              priority inheritance - check and set xInheritanceOccurred
25:              wait for semaphore to become available
26:              yield
27:          }
28:          else
29:          {
30:              xTaskResumeAll();
31:              attempt to take the semaphore again in the next iteration of the loop.
32:          }
33:      }
34:      else
35:      {
36:          /* timed out. */
37:          xTaskResumeAll();
38:  
39:          if( semaphore is not available )
40:          {
41:              return errQUEUE_EMPTY
42:          }
43:          else
44:          {
45:              attempt to take the semaphore again in the next iteration of the loop.
46:          }
47:      }
48:  }

The sequence of events is as follows:

Thread A acquires the semaphore successfully.
Thread B attempts to acquire the semaphore with timeout 10.
Thread B sees that the semaphore is not available at line #5. It sees that xTicksToWait is not zero at line #11, and evaluates the timeout at line #20. The timeout has not expired and semaphore is not available (checks at line #20 and #22). Priority inheritance is performed and thread B yields at line #26.
Thread A releases the semaphore.
Thread B is awaken from the yield, and enters the loop again at line #2.
Thread A re-acquires the semaphore before thread B enters the critical section at line #3.
Thread B sees that the semaphore is not available at line #5. It sees that xTicksToWait is not zero at line #11, and evaluates the timeout at line #20. The timeout has now expired, so xTicksToWait is set to 0 and Thread B continues to line #37.
Thread A releases the semaphore before Thread B executes the check at line #39.
Thread B goes to line #45 and reenters the loop.
Thread A re-acquires the semaphore thread B enters the critical section at line #3.
Thread B sees that the semaphore is not available and xTicksToWait is zero and the assertion triggers.

This seems reasonable to me and I think we should remove the assertion.

rtel · October 12, 2022, 3:16pm

Thanks for this excellent information - appreciate the effort put into the explanation.

One part I’m not sure about is what happens between numbers 5 and 6: If Thread B has the higher priority, then how can Thread A take the semaphore again before B?

[edit: A should disinherit the higher priority in step 4]
[edit 2: There could of course be another task that takes the semaphore, even if task A can’t - so the conclusion that the assert isn’t required probably holds regardless. I think the assert is more to check for logic errors during testing, so this probably highlights a testing gap too]

aggarg · October 12, 2022, 3:34pm

We are talking about multi-core system here.

rtel · October 12, 2022, 7:58pm

Ah, makes sense then. Thanks.

aggarg · October 13, 2022, 5:33am

@Erlkoenig90 Would you like to raise a PR to remove this assert or would you like us to do that?

Erlkoenig90 · October 13, 2022, 7:35am

I filed a pull request #576. Thanks for looking into this!

aggarg · October 13, 2022, 8:27am

Thank you for your contribution.