Timer START_DONT_TRACE Race Condition?

I’m working with an older version of FreeRTOS, v9.0.0, and have run across a problem similar to FreeRTOS Real Time Kernel (RTOS) / Bugs / #175 Timer to executes after deletion. In our case we have timers that get created for a short period of time that are auto-reloading and have short timeouts, 20 ms. We have been running into crashes as we exercise the system and they always link back to the prvProcessTimerOrBlockTask function, crashing specifically in the uxListRemove function. After some digging I found the cause to be the continued use of a deleted timer. Eventually my belief is the follow scenario is occurring:
a) While the timer task is servicing expired timers, a timer, say timer X, has had xTimerStop() and xTimerDelete() call on it, so these commands are added to the queue.
b) After the above two calls are made on timer X, the same timer is found to be expired, it’s callback routine is called.
c) Since the timer is auto-reloading, an attempt is made to add it back to the list, but due to the short timeout, it has expired again, so the START_DONT_TRACE command with timer X is added to the command queue.
d) When the command queue is service, timer X is stopped, deleted, and then the deleted timer pointer is put back in the timer list as an active timer.
e) At sometime after the delete is performed, the memory associated with the previous timer X is reused so the data within changes.
f) When the timer expires again, the system crashes de-referencing pointers that point to inappropriate places.
I have a quick fix for this which is to keep a local array (on the stack) of deleted timer pointers within the prvProcessReceivedCommands() function and anytime a START_DONT_TRACE command is received, check if the timer pointer matches a timer that was deleted in this execution of prvProcessReceivedCommands(). If a match is found, I break out of the START/RESET case so the stale timer pointer is not put back in the list. There is a possibility that the deleted timer memory would be reused immediately, thus I would be ignoring a valid timer, but I think this is a remote but possible case. The other thought I had was to add a magic number marker into the Timer_t structure and clear it during deletion, then this could be used to detect the case and avoid the crash. Again there are holes in this idea. The reason for this message is to let you know of the issue (it does not appear to be addressed in the latest timer.c code) and then ask for your suggestions on how to fix it. My quick fix is working and we are successfully avoiding the crashes we were seeing, but again, I’m not sure this is the best way to fix the issue, so was hoping for suggestions.
Steve.

How does configTIMER_TASK_PRIORITY compare to the priorities of the tasks that call the timer API functions?

#define configTIMER_TASK_PRIORITY ( 5 )

All other tasks in the system, except one, have priorities below 5. The one other task with a priority of 5 is a watchdog task which uses vTaskDelay for it waits, it does not use the timer APIs.

In that case, it would be unusual for the timer command queue to have more than one entry in use. Seems to me the scenario you describe requires multiple commands to be in the queue at the same time.

With your task priorities, you could end up with multiple commands in the queue if you call timer API functions from a timer callback function. Do you do that? Another way would be if you make timer API calls while the scheduler is disabled. Do you do that? EDIT: Another way is the FromISR() timer APIs. How about those?

We don’t disable the scheduler after it is started but we do cancel (stop then delete) timers in callback and on occasion start a different timer in a callback.

Sounds like in your case, another (maybe preferred) workaround would be not to stop/delete timers from a timer callback function. Instead, defer the stopping/deleting to some of your own task code.

The basic rule is you can’t delete a object (like at Timer) that has is still ‘in use’, i.e. has a reference. Issuing a Stop command in the callback delays the stop till the callback is done, so you can’t delete the timer then.

As Jeff says, the issue is deleting. Why not just stop it and then reuse it when you need it again rather than recreating it.

Thank you for your responses. We will take this into consideration as we decide how to move forward. Thanks again, Steve.

One further question, is there not a race condition here in the singular stop case as well? Consider the same scenario except without the deletion. If the xTimerStop() command is called within the timer callback on a short timeout auto-repeating timer, could not the START_DONT_TRACE command be queued behind the STOP in the command queue and then the timer end up being restarted when the intent was to stop it?

I would have to look at the code to see how it is handled, but at least in that case the Timer structure still exists so could record that the stop happened and the START_DONT_TRACE could see that the system changed and not restart. This get around the undefined behavior of using something that has been deleted, and moves you into just atomic race conditions.

Hi Steve - yes I agree the issue does appear to affect both stop and delete. Deferring those to one of your own task’s contexts (instead of from the timer task’s context) should be a solid workaround for you.

Hi Jeff, I understand your comment but it feels like a bit of a dodge to me. I can understand the “don’t delete a timer in it’s callback” comment, I’m with you there. The “don’t stop a timer in it’s callback” is a bit harder to understand. This feels much more like a timer bug. I suppose the alternative to calling it a bug is to call it out in the documentation as an unsupported behavior. I didn’t see it called out in the documentation, maybe I missed it. Both items would have been helpful to know when we started our project many moons ago. Just my two cents. Thanks, Steve.

Just for clarity, deleting a timer from its callback function is normally fine, even if it’s an auto-reload timer. Similarly, stopping an auto-reload timer from its callback function is also normally fine.

However, if the timer task falls far behind in its work, auto-reload timers may not stop/delete properly. It seems that use case hasn’t gotten much attention in the code or docs. But it’s a strange use case.

In fact, I don’t see how you have this use case at all. Since your timer task has a high priority, it shouldn’t be falling behind, and it definitely shouldn’t be falling far enough behind to cause this issue.

Are you doing anything in your timer callback functions that can block or take a long time? Any combination of ISRs that can take a long time? You’re looking for something that can starve the timer task for two full timer periods.

This issue occurs in one particular timer usage. This timers callback function is relatively short, although this can be deceiving, but there is no blocking involved. What I can say is that we are working on downgrading the microcontroller that this code runs within, so the case where this is occurring is code that works fine at a higher speed but crashes at a lower speed. This I’m sure will lead the thoughts that we are spending too much time in the callback, which I certainly cannot argue away. As far as ISR’s, we have a few, but not that many and generally short, most things are done via polling with delays. I also wanted to reiterate that this issue is hit only occasionally. We were seeing the crashes sometimes after 100’s of iterations without seeing the issue. My guess at this point is the issue is due to some timing that we are right on the edge of. Of course it easy to say we should re-structure our code but not always as easy to do. I have a work-around now, by tracking deletions and stops within the command queue processing of loop. My goal here was to gather more information from people more familiar with the inner workings of the timers, which I have certainly accomplished (thanks to your and Richard’s comments) and hopefully to help others or help the FreeRTOS baseline.

Understood. Keep in mind as you move ahead that all of the timer callbacks (not just the one for the timer giving you trouble) are at issue here. All timer callbacks should be short. And even more importantly, timer callbacks must never block.

One possible justification for all the labor of restructuring – you may already have timers falling behind schedule, even at your current higher clock speed. Seems that is the case anyway. That may be causing other issues worth fixing.

FYI, the START_DONT_TRACE race condition is eliminated in PR 305.