Troubleshooting which task gets stuck

We have a device running 20+ tasks, and we’re investigating an issue where the device gets stuck after running for a few days. The challenge is that we don’t know which task is causing the issue.

Is there a way to determine which task is stuck when the problem occurs?

Currently, I’m attempting to track task execution by printing a count for each task in the main task, incrementing it every time the task runs. This helps indicate whether a task is still active, similar to a heartbeat check.

Please let me know if there are other recommended ways to troubleshoot this issue.

Many thanks!

I have come across a pretty (I think) slick architecture at a customer’s site. It is basically a software watch dog based system working like this: A task that participates in the scheme periodically registers itself with a supervisor, informing the supervisor how much time it (the client task) expects to be busy before calling back again. The supervisor sets itself a marker in the future. Because several tasks participate in the scheme, the supervisor always has a list of pending “I’ll be back by 8, if not, call the police” events. A task that has completed its deadline will cancel its pending expiration event and optionally set up another one, possibly with a different expiration time.

If the supervisor detects that a deadline has expired, it knows which task failed to meet its deadline and can take appropriate action.

Have you verified that its stuck because of a fault (example, Hardfault) or unavailability or improper access to resources (example, memory or deadlock)? If not, I would try attaching a debugger and keep it running for the period till the issue usually shows up. Once its stuck, you can pause the execution to inspect where the code is stuck and the status of various tasks if you are using a FreeRTOS kernel-aware debugger.

Another suggestion is to enable Malloc Failed Hook Function and stack overflow checking if not already, you can define the corresponding hook functions to have an assert in case they were ever called, thereby making it easier to analyze two frequent reasons for crashes.

Also as explained here: Customization - FreeRTOS™, verify that FreeRTOS APIs are not called from interrupts that have a logical priority above the priority defined by configMAX_SYSCALL_INTERRUPT_PRIORITY

There are a few things you might consider:

  1. I have a heartbeat LED designed in, so if the scheduler crashes, I don’t see it blink
  2. I have a tag system (insert #ifdef and code) that shows the line number and file, plus whatever context data you want (calling task, etc). That requires no hardware modification but does require a (for instance) serial output to record the tags.
  3. have each task reset an external variable in a loop. The idle task has a value that checks each incrementing task. If it times out, then an error has happened.
  4. an error LED triggered by common error conditions (malloc failed, etc).

Some of these need hardware design changes, some software changes.
I’ve used 1, 2, and 4