eTaskGetState from ISR version

KaDw · March 8, 2024, 8:49am

I’m working on a feature that allows to dump all task states on watchdog timeout interrupt. It would be quite useful for post-mortem debugging. The problem is during time of dumping the data I’m in the ISR context and I can’t use eTaskGetState.

Since the MCU is based on Cortex-M4 it can’t enter critical section while it is in the ISR. Is there any workaround? I’m sure that the scheduler won’t have a chance to run and whole device will restart after dumping the data.

rtel · March 8, 2024, 7:40pm

You can use taskENTER_CRITICAL_FROM_ISR() inside ISR functions, but I’m wondering if that is necessary if you don’t intend to return from the interrupt anyway, if the watchdog is resetting the processor.

KaDw · March 10, 2024, 11:13am

@rtel Thanks, Is there any chance to add to FreeRTOS API a eTaskGetStateFromISR() just with the taskENTER_CRITICAL_FROM_ISR()?

If I want to have the change now I’m wondering if there is a better way than just adding eTaskGetStateFromISR() with the change you mentioned to the tasks.c file. From what I understand it has to be put somewhere in the kernel because this function is accessing members of pxTCB kernel structure that is hidden from the user. Is there any better way to do it?

RAc · March 10, 2024, 11:57am

Hi Karol,

I am afraid that the use case you are looking at is so rare and isolated that it does not justify adding extra functionality and OS code to it. Also, a good number of WD Timeouts follow scenarios in which the feature you envision does not help (for example, starvation after priority inheritance issues).

In my experience, the following two strategies have proven most useful for watchdog timeout analyses:

Running the app under Control of tracealyzer or a compatible tool
record the value of xCurrentTCB at WD timeout. This will at least tell you which task owned the CPU at timeout time, and you can analyze the TCB at that time. If you have a FreeRTOS aware IDE, you don’t even need to do that manually.

The latter strategy (neither yours,btw) will be of use, however, if your WD failed to be retriggered in time due to trashing ISRs. For that kind of thing, a kernel analysis as provided by tracealyzer will be the only meaningful and also least invasive technique to determine what happened.

Edit: The other issue, of course, is that you need to have access to the list of task handles before calling this API - note that passing in NULL at WD ISR time is not a good idea as the wd may have kicked in while an isr was executing in which case there may be no valid task handle available, so more work would need to be done aside from providing an isr “safe” version of the Api - which, as Richard pointed out, isn’t even needed.

KaDw · March 11, 2024, 8:30am

The problem with WD timeouts is that they happen rarely and sometimes it takes days to reproduce them. It is especially hard if you have devices in the field and can’t access them in any way. The only option you often have is post-mortem analysis.

I’m familiar with Tracealyzer, it is good for some issues but on the other hand I had multiple cases when slowing down the whole MCU becuase of Tracealyzer logging just hid the problem.

I agree that the 2nd strategy of recording current TCB is often not enough, like you mentioned if you have a periodic timer running or a periodic task waking up it will mask the problem. Every strategy has its shortcomings but if they are combined it may show you something useful

In my opinion, the ultimate solution for all the problems is core dump and analyzing it offline with the GDB. Zephyr has such a featutre.

RAc · March 11, 2024, 8:35am

Hmmm, how would a core dump help in this scenario? Don’t you need the history that lead to the timeout? A core dump would only given you a frozen system state, right? In my experience, a deep trace is the only way to forensically figure out what happened.

aggarg · March 12, 2024, 5:47am

Assuming that you have read all the pitfalls pointed by @RAc and you still want to do it, you can use FREERTOS_MODULE_TEST - FreeRTOS-Kernel/tasks.c at main · FreeRTOS/FreeRTOS-Kernel · GitHub. You need to do the following -

Define FREERTOS_MODULE_TEST in your FreeRTOSConfig.h:
```
#define FREERTOS_MODULE_TEST  1
```
Create a file named tasks_test_access_functions.h at a location which is in your include search paths.
Add your definition of eTaskGetStateFromISR in the file tasks_test_access_functions.h.

KaDw · March 12, 2024, 8:39am

@RAc I see your point, the history seems nice. I would have to log only essential data as the Tracealyzer dump grows quickly. The core dump that I mentioned will save whole RAM, you can examine every task there is at given moment in time.

@aggarg
Thank you! That is exactly what I was looking for.

RAc · March 12, 2024, 8:42am

yes, but only at the time of the dump. This is my point. The chain of events that has caused the WD to timeout has been lost in history at that point. From my understanding, you can not recover that history from the core dump, you must record it. Or am I missing something?

KaDw · March 12, 2024, 10:47am

Assuming that you reset watchdog in the lowest priority task and higher priority task is preventing you from doing so, you can get enough information to trace it just from the core dump. It turns out this is the case I have to deal with right now.

Of course there are cases where history is essential to trace what happened.

RAc · March 12, 2024, 11:02am

Could you outline how you do that? In my understanding, the high priority task must have starved the watchdog retrigger task for the configured amount of time for the wd to reset. The only case in which a core dump could help you there would be if you can positively deduct that the high pri task is stuck in an endless loop by mapping its PC to a code location and the other registers to automatic variables. However, that is only one possible cause for the WD timeout, as I pointed out before. So a core dump is certainly better than nothing at all but may in many cases not be sufficient.

KaDw · March 12, 2024, 12:01pm

It is done exactly as you described.