Rare case of timer callback with scheduler disabled

jefftenney · May 19, 2025, 5:50pm

@richard-damon recently noted that the timer task might call a callback with the scheduler disabled. Looking at this code from the perspective of a tick-count rollover, it seems he is right. That behavior is probably not desirable, even though it’s a rare case where timers are being processed late and there’s a tick-count rollover and timer callbacks should be short by design and should never attempt to block anyway. If there is interest I would be happy to submit a PR to address this little issue. Here is the idea I had.

hs2 · May 19, 2025, 7:18pm

Looks good @jefftenney Thanks for your effort !
In my opinion it’s worth the PR to make the timers even more robust.

richard-damon · May 19, 2025, 8:40pm

I would need to check things more closely to be sure, but this change will make the calls to prvReloadTimer called in prfProcessExpiredTimer have the wrong value for xTimeNow. This might not matter, as we will catch up to the current time after we let the lists roll over.

jefftenney · May 19, 2025, 8:50pm

Thanks for looking at the idea. It’s essentially holding off the timer task’s recognition of the rollover. Turns out the idea wasn’t fully baked. This other usage of prvSampleTimeNow() isn’t compatible with holding off the rollover. It would allow the very thing forbidden by the comment above that line. A command would have a newer timestamp (post rollover) than the current acknowledged time in the timer task. No good.

I’m still happy to help, but I would need to put a little more time into it. I do still think it’s worth cleaning up so we never call a timer callback with the scheduler disabled.

richard-damon · May 19, 2025, 9:17pm

My thought, looking at the code is the question, why do we need to stop the scheduler here? My guess is that it may be to try to minimize (but can’ t eliminate) the possibility of time slipping between the getting of xTimeNow and the call to vQueueWaitForMessageResticted later in the code (that is a special Queue operation that is designed to be called with the scheduler suspended, that will resume it before blocking).

Instead, if we can fix the code to not need the scheduler disabled here, and I am not sure it does, but instead check just before calling the wait for message, what the time is NOW, and perhaps loop again if needed.

jefftenney · May 20, 2025, 12:57am

Interesting idea. Suspending the scheduler is important to the design, but here is a way to delay suspending the scheduler until after the call to prvSampleTimeNow(). Just another idea for now.

richard-damon · May 20, 2025, 1:33am

I’m not so sure that suspending the scheduler is that important. It seems it is just an attempt to keep the block time correct, but it actually doesn’t do it, as the tick interrupt can still happen and increase the tick, causing the block to go too late.

What would be better would be for the “restricted” function to instead of taking a blocking time, a time to block too. That or make it a real critical section, and not just a scheduler suspension. And actually, having versions of the blocking operations that take a timestamp of the time to block to, rather than a time period could be useful for similar cases in user code.

jefftenney · May 20, 2025, 3:49am

When the scheduler is suspended, OS ticks accumulate into an xPendedTicks field that is private to tasks.c instead of into the system tick count. That way the system tick count advances only when the scheduler can properly process each tick for task readiness.

As far as FreeRTOS APIs taking a tick count to delay until, I’m not sure we should do that. It would be difficult (impossible?) for the API implementation to decide if a tick time provided by the application is in the future or in the past.

aggarg · May 20, 2025, 5:33am

Very interesting discussion.

As @jefftenney explained, this cannot happen because the ticks are accumulated in xPendedTicks when the scheduler is suspended.

I wanted to point out this section in the documentation which states that:

Timer callback functions execute in the context of the timer service task. It is therefore essential that timer callback functions never attempt to block. For example, a timer callback function must not call vTaskDelay(), vTaskDelayUntil(), or specify a non zero block time when accessing a queue or a semaphore.

Given that we prohibit blocking in the timer callbacks, do we need this?

richard-damon · May 20, 2025, 1:12pm

aggarg:

I wanted to point out this section in the documentation which states that:
Timer callback functions execute in the context of the timer service task. It is therefore essential that timer callback functions never attempt to block. For example, a timer callback function must not call vTaskDelay(), vTaskDelayUntil(), or specify a non zero block time when accessing a queue or a semaphore.
Given that we prohibit blocking in the timer callbacks, do we need this?

I would say something needs to be fixed. The problem is that statement doesn’t explain why, and if you look at the various times blocking in Timer Callbacks gets discussed, the reason given isn’t the real reason, but is assumed to be because it interfears with other Timer Callbacks getting invoked in a timely manner. And in fact, the prohibition isn’t correct, as Callbacks CAN do those actions, if they first confirm that the scheduler is running, and they accept the other side effects of delaying the timer task.

As to:

This appears to be undocumented behavior, and the Timer task using the behavior, and a restricted function to implement the desired wait on a queue until a specified tick count (instead of blocking for a specified time) which is an operation that user code might want to do also, shows a slight weakness in the current definitions.

As to:

That is true only if you just pass a single tick count. What would need to be done is something like is done in xTaskCheckForTimeout, where we store a “full” tick count (that includes the roll over counter) and the time period, or perhaps have a function that can manipulate the TimeOut_t to point to the “future” time to wait for. Another option would to pass two tick counts, one being a base, that is known not to be in the future, and the second being a tick count known not to be in the past of that base. Given those, we can know if the second tick count is in the current future or not (sort of what xTaskDelayUntil does after adding the delay time to the previous time.)

richard-damon · May 21, 2025, 4:54pm

Thinking a bit of minimal changes, the source of the problem is in here:

 static TickType_t prvSampleTimeNow( BaseType_t * const pxTimerListsWereSwitched )
    {
        TickType_t xTimeNow;
        PRIVILEGED_DATA static TickType_t xLastTime = ( TickType_t ) 0U;

        xTimeNow = xTaskGetTickCount();

        if( xTimeNow < xLastTime )
        {
            prvSwitchTimerLists();
            *pxTimerListsWereSwitched = pdTRUE;
        }
        else
        {
            *pxTimerListsWereSwitched = pdFALSE;
        }

        xLastTime = xTimeNow;

        return xTimeNow;
    }

It is where the code calls prvSwitchTimerLists(), which will run the expired timer callbacks. when this is done inside the Scheduler suspension we get the problem.

This could be remove by testing the scheduler state before calling prvSwitchTimerList(), in if it is suspended, restore it. Then make the call, and if it was suspended, suspended it again, and then refetch the tick count.

jefftenney · May 21, 2025, 11:12pm

No, the proposed change doesn’t seem to be a need. The alternative is for the Timer API docs to indicate that callback functions might be called with the scheduler disabled. If we give the timer implementation that freedom, and publish it in the API docs, then we don’t need to change the code.

The most compelling reason to make the code change is probably to help applications that are currently “getting away with” blocking in a timer callback. Those applications are probably passing all of their tests and seem to function properly, but really there’s a subtle bug that would only show up at tick-count rollover, because the Timer task unexpectedly changes its behavior at tick-count rollover.

Since these applications are already doing something “wrong”, maybe we don’t have any obligation to try to help them out. And since that’s the most compelling reason I could think of, maybe a small documentation update would be better.

jefftenney · May 23, 2025, 7:22am

@richard-damon Thanks for looking at it so carefully and for your suggestions.

Here is what I finally ended up with. It’s pretty clean, but ultimately just a minor preference and not truly a need.

richard-damon · May 23, 2025, 4:49pm

My idea wasn’t to delay the timer list switch as your is doing, but just reenable the scheduler do the switch, the suspend the scheduler and refetch the time.

so instand of your

                xTimeNow = tmrMAX_TIME_BEFORE_OVERFLOW;
                *pxTimerListsWereSwitched = pdFALSE;

instead do:

                vTaskResumeAll();
                prvSwitchTimerLists(); 
                vTaskSuspendAll();
                xTimeNow = xTaskGetTickCount();

I think your method is like one that was proposed earlier, but it may have an issue with the other case where the function was called.

jefftenney · May 23, 2025, 6:03pm

I was able to resolve all the issues. I thought about the pros/cons and I came to prefer this:

    xTimeNow = tmrMAX_TIME_BEFORE_OVERFLOW;
    *pxTimerListsWereSwitched = pdFALSE;

over this:

    vTaskResumeAll();
    prvSwitchTimerLists(); 
    vTaskSuspendAll();
    xTimeNow = xTaskGetTickCount();
    *pxTimerListsWereSwitched = pdTRUE;

aggarg · May 26, 2025, 9:46am

If the pxCurrentTimerList happened to have more than one timers at the time of tick overflow, would it now take several iterations of timer task loop to process those?

Would you please share pros and cons too?

jefftenney · May 26, 2025, 6:26pm

Yes. Although, if we happen to process a command while handling the backlogged timers, then we clear the entire backlog while processing the command.

The pros/cons I considered were all based on code style. The biggest consideration is the optics of a function “breaking” the scheduler critical section of its caller and then attempting to resume it. If we allow prvSampleTimeNow() to do this, then we are allowing it to know why the caller is using a scheduler critical section in the first place. In that case, we should also allow the function to know that the caller doesn’t need the critical section if the lists are switched.

    xTaskResumeAll();
    prvSwitchTimerLists(); 
    vTaskSuspendAll();
    xTimeNow = xTaskGetTickCount();
    *pxTimerListsWereSwitched = pdTRUE;

would become this:

    xTaskResumeAll();
    prvSwitchTimerLists(); 
    xTimeNow = xTaskGetTickCount();
    *pxTimerListsWereSwitched = pdTRUE;

and the caller would know that prvSampleTimeNow() ended the critical section because it switched the lists.

And then it would become this:

    xTaskResumeAll();
    prvSwitchTimerLists(); 
    *pxTimerListsWereSwitched = pdTRUE;

because xTimeNow doesn’t need updating anymore.

And there’s more. The next logical step would be not to check if there are backlogged timers before ending the scheduler critical section. Switching the lists alone is sufficient to end the scheduler critical section.

In the end, I think it would end up something like this:

And I would probably prefer this one now.

Summarizing - two code solutions exist, one documentation-only solution exists, and we can also choose to do nothing. The two code solutions are (1) delaying recognition of rollover in the timer task and (2) letting prvSampleTimeNow() resume the scheduler.

richard-damon · May 26, 2025, 6:52pm

Remember, the function we are looking at changing is.a private function within the Timer Task code, and thus can “know” why that code was using the suspension.

The issue with your code that just does the xTaskResumeAll is that the path in question NEEDS the scheduler suspension so that it can compute a correct time to block, knowing that the xTimeNow is the current time now, which is might not be if the scheduler has been running. It also means that the Caller will also resume the scheduler, when the scheduler should no longer be suspended (unless you change that path).

This proposed changed would require the caller to not resume the scheduler.

jefftenney · May 27, 2025, 2:46pm

The scheduler suspension is not needed in the case of a timer list switch. You can review the code to see how the caller responds when prvSampleTimeNow() switches the lists. Then take a look at the github link and let me know if I’ve missed something.

richard-damon · May 27, 2025, 3:15pm

With the current code, if prvSampleTImeNow() says it switched, then yes, it doesn’t block, but it does resume the scheduler, which would be an error.

Currently, everywhere in the code, there are in close proximity Suspend and Resume calls, so a simple visual inspection can confirm no error in that nesting.

Your proposed method puts the resume for the suspend in a different function, and thus not clearly linked to each other. By resuspending, at least we can see at each level that balance has been maintained, even if in reverse order in one place.