Timer START_DONT_TRACE Race Condition?

Steve1 · April 1, 2021, 11:32pm

I’m working with an older version of FreeRTOS, v9.0.0, and have run across a problem similar to FreeRTOS Real Time Kernel (RTOS) / Bugs / #175 Timer to executes after deletion. In our case we have timers that get created for a short period of time that are auto-reloading and have short timeouts, 20 ms. We have been running into crashes as we exercise the system and they always link back to the prvProcessTimerOrBlockTask function, crashing specifically in the uxListRemove function. After some digging I found the cause to be the continued use of a deleted timer. Eventually my belief is the follow scenario is occurring:
a) While the timer task is servicing expired timers, a timer, say timer X, has had xTimerStop() and xTimerDelete() call on it, so these commands are added to the queue.
b) After the above two calls are made on timer X, the same timer is found to be expired, it’s callback routine is called.
c) Since the timer is auto-reloading, an attempt is made to add it back to the list, but due to the short timeout, it has expired again, so the START_DONT_TRACE command with timer X is added to the command queue.
d) When the command queue is service, timer X is stopped, deleted, and then the deleted timer pointer is put back in the timer list as an active timer.
e) At sometime after the delete is performed, the memory associated with the previous timer X is reused so the data within changes.
f) When the timer expires again, the system crashes de-referencing pointers that point to inappropriate places.
I have a quick fix for this which is to keep a local array (on the stack) of deleted timer pointers within the prvProcessReceivedCommands() function and anytime a START_DONT_TRACE command is received, check if the timer pointer matches a timer that was deleted in this execution of prvProcessReceivedCommands(). If a match is found, I break out of the START/RESET case so the stale timer pointer is not put back in the list. There is a possibility that the deleted timer memory would be reused immediately, thus I would be ignoring a valid timer, but I think this is a remote but possible case. The other thought I had was to add a magic number marker into the Timer_t structure and clear it during deletion, then this could be used to detect the case and avoid the crash. Again there are holes in this idea. The reason for this message is to let you know of the issue (it does not appear to be addressed in the latest timer.c code) and then ask for your suggestions on how to fix it. My quick fix is working and we are successfully avoiding the crashes we were seeing, but again, I’m not sure this is the best way to fix the issue, so was hoping for suggestions.
Steve.

jefftenney · April 2, 2021, 12:54am

How does configTIMER_TASK_PRIORITY compare to the priorities of the tasks that call the timer API functions?

Steve1 · April 2, 2021, 1:37am

#define configTIMER_TASK_PRIORITY ( 5 )

All other tasks in the system, except one, have priorities below 5. The one other task with a priority of 5 is a watchdog task which uses vTaskDelay for it waits, it does not use the timer APIs.

jefftenney · April 2, 2021, 2:12am

In that case, it would be unusual for the timer command queue to have more than one entry in use. Seems to me the scenario you describe requires multiple commands to be in the queue at the same time.

With your task priorities, you could end up with multiple commands in the queue if you call timer API functions from a timer callback function. Do you do that? Another way would be if you make timer API calls while the scheduler is disabled. Do you do that? EDIT: Another way is the FromISR() timer APIs. How about those?

Steve1 · April 2, 2021, 3:11am

We don’t disable the scheduler after it is started but we do cancel (stop then delete) timers in callback and on occasion start a different timer in a callback.

jefftenney · April 2, 2021, 3:49am

Sounds like in your case, another (maybe preferred) workaround would be not to stop/delete timers from a timer callback function. Instead, defer the stopping/deleting to some of your own task code.

richard-damon · April 2, 2021, 11:37am

The basic rule is you can’t delete a object (like at Timer) that has is still ‘in use’, i.e. has a reference. Issuing a Stop command in the callback delays the stop till the callback is done, so you can’t delete the timer then.

As Jeff says, the issue is deleting. Why not just stop it and then reuse it when you need it again rather than recreating it.

Steve1 · April 2, 2021, 11:58am

Thank you for your responses. We will take this into consideration as we decide how to move forward. Thanks again, Steve.

Steve1 · April 2, 2021, 12:17pm

One further question, is there not a race condition here in the singular stop case as well? Consider the same scenario except without the deletion. If the xTimerStop() command is called within the timer callback on a short timeout auto-repeating timer, could not the START_DONT_TRACE command be queued behind the STOP in the command queue and then the timer end up being restarted when the intent was to stop it?

richard-damon · April 2, 2021, 2:09pm

I would have to look at the code to see how it is handled, but at least in that case the Timer structure still exists so could record that the stop happened and the START_DONT_TRACE could see that the system changed and not restart. This get around the undefined behavior of using something that has been deleted, and moves you into just atomic race conditions.

jefftenney · April 2, 2021, 6:03pm

Hi Steve - yes I agree the issue does appear to affect both stop and delete. Deferring those to one of your own task’s contexts (instead of from the timer task’s context) should be a solid workaround for you.

Steve1 · April 5, 2021, 12:31pm

Hi Jeff, I understand your comment but it feels like a bit of a dodge to me. I can understand the “don’t delete a timer in it’s callback” comment, I’m with you there. The “don’t stop a timer in it’s callback” is a bit harder to understand. This feels much more like a timer bug. I suppose the alternative to calling it a bug is to call it out in the documentation as an unsupported behavior. I didn’t see it called out in the documentation, maybe I missed it. Both items would have been helpful to know when we started our project many moons ago. Just my two cents. Thanks, Steve.

jefftenney · April 5, 2021, 5:20pm

Just for clarity, deleting a timer from its callback function is normally fine, even if it’s an auto-reload timer. Similarly, stopping an auto-reload timer from its callback function is also normally fine.

However, if the timer task falls far behind in its work, auto-reload timers may not stop/delete properly. It seems that use case hasn’t gotten much attention in the code or docs. But it’s a strange use case.

In fact, I don’t see how you have this use case at all. Since your timer task has a high priority, it shouldn’t be falling behind, and it definitely shouldn’t be falling far enough behind to cause this issue.

Are you doing anything in your timer callback functions that can block or take a long time? Any combination of ISRs that can take a long time? You’re looking for something that can starve the timer task for two full timer periods.

Steve1 · April 5, 2021, 10:57pm

This issue occurs in one particular timer usage. This timers callback function is relatively short, although this can be deceiving, but there is no blocking involved. What I can say is that we are working on downgrading the microcontroller that this code runs within, so the case where this is occurring is code that works fine at a higher speed but crashes at a lower speed. This I’m sure will lead the thoughts that we are spending too much time in the callback, which I certainly cannot argue away. As far as ISR’s, we have a few, but not that many and generally short, most things are done via polling with delays. I also wanted to reiterate that this issue is hit only occasionally. We were seeing the crashes sometimes after 100’s of iterations without seeing the issue. My guess at this point is the issue is due to some timing that we are right on the edge of. Of course it easy to say we should re-structure our code but not always as easy to do. I have a work-around now, by tracking deletions and stops within the command queue processing of loop. My goal here was to gather more information from people more familiar with the inner workings of the timers, which I have certainly accomplished (thanks to your and Richard’s comments) and hopefully to help others or help the FreeRTOS baseline.

jefftenney · April 6, 2021, 12:41am

Understood. Keep in mind as you move ahead that all of the timer callbacks (not just the one for the timer giving you trouble) are at issue here. All timer callbacks should be short. And even more importantly, timer callbacks must never block.

One possible justification for all the labor of restructuring – you may already have timers falling behind schedule, even at your current higher clock speed. Seems that is the case anyway. That may be causing other issues worth fixing.

jefftenney · April 13, 2021, 6:27pm

FYI, the START_DONT_TRACE race condition is eliminated in PR 305.

kmcwithey · October 3, 2024, 7:28pm

I believe the START_DON’T_TRACE issue is not completely fixed.

In our case we are starting the timer when a button is pressed and stopping it when the button is released. We are using the edge interrupts of a GPIO to manage the rise a fall detection and calling the xTimerStopFromISR and xTimerStartFromISR functions in those handlers.

If the button is clicked fast enough, or other ISRs delay the timer, both edge ISRs happen before the start is processed in the timer. Then, when the start is processed it adds a START_DON’T_TRACE command to the queue after the stop command from the button release edge.

Would it make sense to iterate through the command queue and remove any START_DON’T_TRACE commands for the given timer when a stop command is processed by the timer?

Right now, when the stop command is processed, all it does is clear the tmrSTATUS_IS_ACTIVE flag which is set again when the START_DON’T_TRACE command is processed.

kmcwithey · October 3, 2024, 10:42pm

I forgot, the message queue is a FIFO and doesn’t have support to iterate through and remove a message in the middle. So that won’t work.

The following changes to timer.c fixed it for me:

-Move the line “pxTimer->ucStatus |= tmrSTATUS_IS_ACTIVE;” before the line “case tmrCOMMAND_START_DONT_TRACE:” so it doesn’t set it for case tmrCOMMAND_START_DONT_TRACE

-If prvInsertTimerInActiveLIst() fails(the next if statement), add an if statement to check if the tmrSTATUS_IS_ACTIVE flag is set before executing the code in that block.

If stop was queued prior to the start being processed it will clear that flag. In all other cases, that flag will be set when the tmrCOMMAND_START_DONT_TRACE command is posted.

I would upload my changes but I’m to new, the forum won’t let me upload files.

Here is the copy and pasted version:

            switch( xMessage.xMessageID )
            {
                case tmrCOMMAND_START:
                case tmrCOMMAND_START_FROM_ISR:
                case tmrCOMMAND_RESET:
                case tmrCOMMAND_RESET_FROM_ISR:
                    /* Start or restart a timer. */
                    pxTimer->ucStatus |= tmrSTATUS_IS_ACTIVE;
                case tmrCOMMAND_START_DONT_TRACE://ucStatus tmrSTATUS_IS_ACTIVE bit will always be set when this command is queued.  If we don't set it here we can use it to determine if a stop command was sent 
                    if( prvInsertTimerInActiveList( pxTimer, xMessage.u.xTimerParameters.xMessageValue + pxTimer->xTimerPeriodInTicks, xTimeNow, xMessage.u.xTimerParameters.xMessageValue ) != pdFALSE )
                    {
                        if(pxTimer->ucStatus & tmrSTATUS_IS_ACTIVE){
                            /* The timer expired before it was added to the active
                            * timer list.  Process it now. */
                            pxTimer->pxCallbackFunction( ( TimerHandle_t ) pxTimer );
                            traceTIMER_EXPIRED( pxTimer );

                            if( ( pxTimer->ucStatus & tmrSTATUS_IS_AUTORELOAD ) != 0 )
                            {
                                xResult = xTimerGenericCommand( pxTimer, tmrCOMMAND_START_DONT_TRACE, xMessage.u.xTimerParameters.xMessageValue + pxTimer->xTimerPeriodInTicks, NULL, tmrNO_DELAY );
                                configASSERT( xResult );
                                ( void ) xResult;
                            }
                            else
                            {
                                mtCOVERAGE_TEST_MARKER();
                            }
                        }
                    }
                    else
                    {
                        mtCOVERAGE_TEST_MARKER();
                    }

                    break;

Do you see any issues with this?

jefftenney · October 4, 2024, 2:20am

Which version of the kernel are you using? It seems you are using an old version, from before the time PR #305 was merged.

kmcwithey · October 4, 2024, 1:17pm

That is entirely possible. It is the version that SiLabs uses in gecko_sdk_4.4.3.

Looks like FreeRTOS Kernel V10.4.3