Watchdog timers using FreeRTOS

notAnumber · March 15, 2023, 12:05pm

Hello!

I have a program running on a micro processor with 6 threads running simultaneously. Add of them are created in the main file.

They all have their own “main-loop” which is a while(true){} statement.

Now I want to implement a watchdog timer which checks that no thread has become stuck. The watchdog timer can be cleared of kicked from the main-loop if everything is fine. But I want some kind of interrupt to be triggered if the timer hasn’t been kicked for a while.

Is this possible to do with FreeRTOS?

Best Regards

RAc · March 15, 2023, 12:23pm

does your target support HW watchdogs?

Generally, yes, of course this is possible and has been done and discussed many times before. You may want to query the forum for watchdog.

richard-damon · March 15, 2023, 3:20pm

The biggest issue tends to be working out a good definition of “not stuck”.

To get an interrupt if not notified of being not-stuck for a given period is exactly what a hardware watchdog does, so if your processor supports it, use that.

If not, you can setup a timer that generates a high priority (preferably one that isn’t block by critical sections) that gets trigger periodically, checks for a flag that says “not stuck”, and if it was set, clears it and returns, and if not and enough time has passed, does the “got stuck” action.

For the flag or kicking the watchdog timer itself, I tend to use a LOW prioirity task (or idle hook) that if all tasks are making adequate progress, and if so sets the flag/kicks the watchdog. Being low priority, it automatically detects if ANY higher priority task starts to starve the system, even if we aren’t monitoring it specifically.

PaulB-AWS · March 15, 2023, 5:26pm

Generally watchdog timers will trigger a system reset when the timer expires. It depends on the specific hardware implementation though.

Building on @richard-damon 's suggestion… One way to accomplish this would be to:

Create an event group
Set a bit in the event group in each task when the execution in the task reaches either the beginning or end of it’s loop
In the idle task, periodically check the contents of the event group.
If all of the required bits are set, pet the watchdog timer and clear all bits in the event group. Otherwise, re-schedule the timer.

richard-damon · March 16, 2023, 12:52am

Some processors do give the option of generating an interrupt (often non-maskable) in the case of being triggers.

I tend not to use the event group method, as often tasks have different timelines needed to be on schedule. So, I will often have the task set a global to the current time stamp to the current time stamp every time it completes its process. The watchdog task can check if the value is recent enough (but not in the future) for that task to determine if it is operating correctly. (and perhaps there are other slots filled in with verification codes)

Til · March 16, 2023, 7:21am

We had the same question with SEGGER embOS pretty often (with basically the same answers) until I decided to add a simple API for watchdog support: embOS Real-Time Operating System User Guide & Reference Manual

Maybe something similar could be added to FreeRTOS.

RAc · March 16, 2023, 7:22am

Actually, it is impossible to design a robust (and certifyable) recovering architecture without hardware support which means at the end ideally an external timer/watchdog chip designed onto the hardware (from what I remember, even MCU internal HW watchdogs are not considered safe). Pure software solutions always suffer the problem that the SW monitor threads themselves may starve due to a software poblem and thus render the watchdog useless. A WD generating an interrupt is also considered insufficient as the ISR itself may get stuck in an infinite loop and thus aggrevate the original problem.

A typical watchdog architecture looks like this:

A dedicated thread/task that under regular circumstances is guaranteed to execute periodically kicks the HW watchdog.
There is no golden answer to the question under what priority the WD kick task should execute (this has been discussed
extensively, also on this forum). This highly depends on the application architecture and system behavior requirements.
The same task that kicks the HW is also in charge of servicing live monitors (SW watchdogs).
(along the lines of what Richard sketched): Any task that needs SW WD supervision can schedule future events,
ie that “wd client task” informs the “wd kick task” that it (the client) expects to report its liveness back in x ticks in
the future.
Once the client task has reached its deadline, it will cancel the pending event (possibly scheduling another event at the same time).
In between kicking the HW watchdog, the wd kick task scans the list of pending events.
If there is an expired event, the wd kick task will deliberately fail to kick the HW WD and thus initiate a WD reset
(of course, possibly recording the source of the event failure).
Needless to say, the processing of the pending event list must be coded such that it always meets the deadline of the HW WD
timeout period.

Of course, this architecture (flexible as it is) has the shortcoming that for the SW WDs, it is the clients’ responsibility to report their expected dealines to the kick task, so if a client fails to do that, the architecture will not recognize all improper timing scenarios.