Assert due to scheduler suspended in timer callback invoked from prvSwitchTimerLists()

dlm2112 wrote on Thursday, March 22, 2018:

I’m running FreeRTOS 8.2.3 (on a Xilinx Zynq), but have looked through code for 9.0.0 and 10.0.1, and the condition I’ll outline seems to exist in all of those versions.

I’m pretty sure that it’s an issue with the timer code (timer.c), but I can certainly be convinced otherwise if I’m doing something wrong!

The general scenario is:

  1. Start a recurrent timer (10 ms interval with 100,000 ticks/per/sec in BSP). The timer callback implementation is relatively short, but does grab a mutex via calls to xSemaphoreTake().
  2. There is at least one other recurrent timer that runs with a longer interval (250 ms) and certainly longer durations than 10 ms. It has its own calls to other mutexes, also using xSemaphoreTake() (although I don’t think that matters).
  3. The issue is an assert being raised from the 10 ms recurrent timer (only after about 12 hour of running) when calling xSemaphoreTake() as part of mutex locking. Specifically, the assertion is from queue.c:1375:
    configASSERT( !( ( xTaskGetSchedulerState() == taskSCHEDULER_SUSPENDED ) && ( xTicksToWait != 0 ) ) );

The portion that’s failing is the scheduler is suspended.

The call stack (from v 8.2.3) is:
vApplicationAssert() portZynq7000.c:206
xQueueGenericReceive() queue.c:1375
local timer callback code here…
prvSwitchTimerLists() timers.c:744
prvSampleTimeNow() timers.c:540
prvProcessTimerOrBlockTask() timers.c:457
prvTimerTask() timers.c:437

What I’m noticing is that the scheduler is stopped via vTaskSuspendAll() in prvProcessTimerOrBlockTask(), and is still suspended when my callback is invoked at timers.c:744. Ultimately, when my timer callback attempts to grab a mutex via xSemaphoreTake(), the assertion happens in xQueueGenericReceive() because the scheduler isn’t running.

So, my questions are:

  1. Are there restrictions that I may be missing (e.g. don’t grab a mutex, or do anything with queues, from within a timer callback)?
  2. Is this a bug in the timer code, wherein the scheduler should (possibly?) be resumed before calling the callback from timers.c:744 and then suspending it again?

FWIW, I’m able to get around this by changing the implementation of my (offending) timer to a task, but the questions remain.

Thanks in advance for thoughts/insight!

  • dave

rtel wrote on Friday, March 23, 2018:

Apologies for the delay in replying - I never received an email notification of this post (because I allowed my email to become full, doh!).

Do you specify a block time when you take the mutex? There is a restriction in timer callbacks that you must never try and block. That is primarily because all timer callbacks execute from the same task context.

dlm2112 wrote on Friday, March 23, 2018:

Hi Richard,

Thanks for responding… I’ve been using FreeRTOS for a couple of years, now, and am generally finding it to work quite well. Thanks for the tool!

To answer your question, yes, I specify a non-zero timeout when acquiring the mutex. I didn’t realize the restriction of non-zero timeouts on mutex locking in timers, and I see (by code inspection) how making it a zero-timeout would relieve the issue I’m seeing.

Since I really do want the mutex in that portion of code and don’t want a zero timeout, I think the best solution for me is to change that timer into a task (which I’ve already done).

Thanks for the clarification!

  • dave

richard_damon wrote on Friday, March 23, 2018:

Is it documented anywhere that timer callbacks are not supposed to block? I don’t see it.

Looking at the code, prvProcessTimerOrBlockTask carefully resumes the scheduler before calling prvProcessExpiredTimer so most of the time blocking is fine, it is only during the tick overflow processing in prvSampleTimeNow() to prvSwitchTimerLists() when the old list is drained before going to the new list that the issue can occur.

While long blocks can cause issues with one timer blocking another, there shouldn’t be a fundamental problem with a timer function blocking (it isn’t like the case of the IdleHook function blocking).

One big issue with this restriction is that this same restriction would seem to be imposed on fuunctions being used with the PendFromISR functionality, but one use of this would be that the responce to an interrupt might need to some processing that needs short blocks (like interrupt waiting I/O) which really shouldn’t be done inside an ISR.

rtel wrote on Friday, March 23, 2018:

Has always been thus. Last paragraph on this page: (and no doubt in the
book too).

richard_damon wrote on Friday, March 23, 2018:

Ok, didn’t notice it there. Would have expected it in the API definition. Might be good to add a comment about that in the documentation for the timer callback function.

I would also disagree that is it is ‘essential’ to avoid blocking in them, yes, indescriminate blocking can cause issues, but replacing the blocking with a busy wait just to avoid the blocking causes more issues. Like in the example above, if a timer routine needs access to a shared resource and needs to grab a mutex to get it, grabbing the mutex with a 0 time block and busy waiting is apt to deadlock the system if the daemon has higher priority than the task that has the mutex (and at the minimum block the timer task). If the timer callback could block on the mutex, it would get prioruty inheritance and run right away, possibly finishing quickly, and then letting the timer callback contine.

It really comes down to the latency requirements for your timers. If you can afford a bit of latency in your timers, you can do more in your timer callbacks and still meet those requirements, and this could include short blocking operations.

If you really intend that blocing should be done in timers callbacks, I would more the resume all to after calling the timer callbacks so the trap will happen on ANY block, not just those on tick overflow, which is a rare occurance. I think the inconsistance here is the problem.

I also do not quite understand why you need to suspend the scheduler here. I would think all the lists it is working on are ‘owned’ by the timer task and thus changes would be synchronous with it, and thus not need to be protected.

dlm2112 wrote on Wednesday, April 04, 2018:

Although I’m able to get around my “issue” by changing my timer to a task (which could arguably be the right thing to do anyway), I agree with Richard D that the inconsistency in behavior is what’s a tad troubling. From my particular example, the scheduler was resumed “most of the time” by the time my timer fired, and thus the mutex lock was grabbed as expected. Thus, everything worked fine. It wasn’t until some other condition occurred that drove the internal list re-arrangement, resulting in the assert from the semaphore code.

As an enhancement, if there’s a chance that the scheduler can be re-enabled before invoking the timer callback (as noted in the original post), then that would be awesome. Of course, I’m not sure what additional issues that may ensue within the FreeRTOS base code as a result, though… It would also be up to the user to recognize that a blocked timer could affect all other timers, since they all run in the context of the Timer task; but users need to know that now anyway, since only one timer callback will ever fire at a time. So, whether one timer is blocked by another because the first one is doing some OS-level blocking operation, or whether it’s blocked because the first timer is simply doing “too much” is somewhat irrelevant.