Software Watchdog Design

carlk3 · February 4, 2021, 5:54pm

I’m looking at implementing a watchdog to detect hangs in any of several tasks. I’m thinking along these lines:

Have a watchdog counter for each task
Each task increments a counter at strategic places within the code
A Software Timer (or Tick Hook?) periodically checks the counters to make sure that no task is hung up

In case a hang (of one or more tasks) is detected, the watchdog

Captures a fault record (say, in flash)
Resets the system

For the fault record, I’d like to be able to capture, say, the Program Counter (PC) and Link Register (LR) of the hung task. The watchdog knows which task is hung. Can it go into the TasksLists and find the stack frame? [Follow the TaskHandle to pxTopOfStack?] Is there any example code around?

rtel · February 4, 2021, 6:18pm

This is a cut and paste of a reply I hurriedly gave to a similar question (in a different channel) recently. Probably something we should write a blog about:

I’m going to give my standard engineers annoying answer and say “it depends on the use case” but in general the watchdog is likely going to want to check:

1. All important tasks are still executing and getting enough CPU time.
2. The scheduler is still executing.
3. The Idle task is getting some CPU time too, showing that none of the higher priority application tasks are stuck in any loops.
4. Interrupts are still executing.

*One way to do that is have all of the above keep an execution (or cycle) count. This can be simply and incrementing volatile unsigned int. For example - if you have a task that must run at least once every 50ms, then when that task executes have it programmatically check that not more than 50ms passed since it last executed (it can do that manually by reading the tick count) - and if the time is breached, latch an error in a variable, otherwise increment that task’s execution/cycle counter.

The watchdog kicking is then performed from a monitor task that runs periodically (or an interrupt that runs periodically). The monitor task only kicks the watchdog if no errors have been latched (everything is healthy) and all the execution/cycle counters are still incrementing (all tasks are still running - so not stuck in an infinite loop, deadlocked, or starved of execution time). If those conditions aren’t met the watchdog isn’t kicked and times out - resetting the system (or the task can reset manually if it has that capability).

There are other ways too - for example you can have tasks set a bit in an event group then periodically check all the expected bits are set in the event group before clearing all the bits again - that shows the tasks expected to run did run (because they set their bit) but doesn’t say how often they ran and so doesn’t guard against tasks having invalid timing, or getting stuck after setting their bit. You then also need to decide what to do when the system reboots - some systems will tell you the reason for last reset - if the last reset was a watchdog timeout then you might want to take an action.

stjepan_skrnjug · February 4, 2021, 6:33pm

Hello,
For purpose of implementing watchdog in multi-tasking project there is some useful information for better understanding:

Hope that helps.

carlk3 · February 5, 2021, 5:02pm

Would a watchdog implemented as a Software Timer callback be able to run if one of the monitored tasks is in a tight loop? I currently have

#define configTIMER_TASK_PRIORITY (configMAX_PRIORITIES - 2)

Also,

#define configUSE_PREEMPTION 1

and

#define configUSE_TIME_SLICING 1

carlk3 · February 5, 2021, 5:07pm

If a watchdog (in another task) has a TaskHandle to a task that he suspects is hung, how can he get a stack frame for that task? pxTopOfStack? Does that point to the current stack frame or the next available stack frame? If the latter, how can he back up to the current stack frame? This project (on Cortex M4F) does use the floating point unit. Are there any code samples around?

richard-damon · February 5, 2021, 5:30pm

The exact meaning of pxTopOfStack is somewhat port dependent, but will tend to be the value of the stack pointer at the point when the task was last interrupted and switched out, after saving all the context. You can look at the port code to see how a task is switched out and state saved to see what the stack structure looks like.

Note, if you are checking as part of an interrupt, then the TCB for the current task will not be up to date, its stack pointer would be gotten from the ISR stack frame.

richard-damon · February 5, 2021, 5:48pm

As long at the task in the tight loop doesn’t have a higher priority than the timer task, the timer callback should run. Since you have the timer task at configMAX_PRIORITIES-2, a task at configMAX_PRIORITIES-1 could block it by running.

Also, if another timer callback hangs, then your watchdog callback wouldn’t get a chance to run.

carlk3 · February 5, 2021, 6:52pm

If I am checking from a Software Timer callback I should be running in the context of the RTOS daemon (or ‘timer service’) task, so the TCB for the task I want to examine should be up to date, right?

carlk3 · February 5, 2021, 7:03pm

Makes sense. configMAX_PRIORITIES is really the number of priorities, right? So configMAX_PRIORITIES-1 is as high as it can get? I do have one real-time task set to configMAX_PRIORITIES-1, so I guess I would need to adjust that… which means I will have to review my other Software Timers for any impacts to the real-time task.

richard-damon · February 5, 2021, 8:20pm

The fact that your top priority task could block this watchdog might not be a reason to move up the timer task. One of the reasons that the timer task has a programable priority is just for such reasons, and if the timer task could give you problems with that task meeting its requirement could be a real good reason to let it be higher. I suppose part of the question is how likely is that task to actually run indefinitely starving the rest of the system. Such ultra priority tasks tend to be fairly simple and easy to verify against such problems.

As to you other question, yes, if your code is in a task, then all other tasks will be up to date.

carlk3 · February 5, 2021, 8:45pm

As an experiment, I attached a debugger, set a breakpoint in one of my tasks, and then looked at the memory pointed to by the TaskHandle of another task:

0x08032ff8	080329e4	032c22c4	08039744	0802ce04	08032ff8	0802cdfc	00000010	00000000	.)…",.D…/…
0x08033018	00000000	08032ff8	00000000	00000004	080324a8	54626956	006b7361	00000000	…/…$…VibTask…
0x08033038	00000000	00000008	00000000	00000004	00000000	00000000	00000002	08040c80	…
0x08033058	8302001f	00000000	000b5bdb	00000000	00000000	08033354	080333bc	08033424	…[…T3…3…$4…
0x08033078	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	…

Comparing it to this (Cortex-M reference manual) diagram:

or this diagram:

I can’t make heads nor tails of it. It looks to me like:

R0 = 0x08032ff8
R1 = 080329e4
R3 = 032c22c4
R12 = 08039744
LR = 0802ce04
PC = 08032ff8
PSR = 0802cdfc

I was hoping to find something useful in PC and LR that would help me locate a hang. On this system, the code is in flash which is 0x10080000 - 0x10100000. PC = 0x08032ff8 and LR = 0x0802ce04 would be pointing into RAM.

carlk3 · February 5, 2021, 9:27pm

Oh, wait, I need to follow one more level of indirection. The TaskHandle has value 0x08032FF8, but that’s pointing to the TCB, which is what I showed above. So, the first field of the TCB is pxTopOfStack, or 0x080329e4. If I look there:

0x080329e4	00000001	08032ac8	00000000	0802d188	a5a5a5a5	a5a5a5a5	a5a5a5a5	a5a5a5a5
0x08032a04	ffffffed	3e800000	469c4000	ffdedbfd	feff77f3	bbfff7bf	37ff95f7	fec75fff
0x08032a24	facfffff	fe5fffff	ffffdf76	ffff97fe	ffde5ff3	bedfdff7	fefdf5f7	fed7f5ff
0x08032a44	beffdbff	00000000	08032ffc	10000000	e000e000	beb73e00	100917d5	1009189a
0x08032a64	61000000	ffddfbff	dffdf7e7	fd5ff7ee	fadf5577	bf800000	3e178897	3e1cd04f

so

R0 = 00000001
R1 = 08032ac8
R2 = 00000000
R3 = 0802d188
R12 = a5a5a5a5
LR = a5a5a5a5
PC = a5a5a5a5
PSR = a5a5a5a5

Still not helpful.

carlk3 · February 5, 2021, 9:37pm

I think I’m getting warmer: if I add 0x64 to the topOfStack, for (one less than?) the size of an 8-byte aligned Extended Stack Frame, I get to

0x08032a48	00000000	08032ffc	10000000	e000e000	beb73e00	100917d5	1009189a	61000000
0x08032a68	ffddfbff	dffdf7e7	fd5ff7ee	fadf5577	bf800000	3e178897	3e1cd04f	3c96f501
0x08032a88	3e3a3326	3e638e2a	3e924926	39b2080f	3f317180	3f15f6d9	3f800000	3ed79000
0x08032aa8	20000010	10091885	00000008	00000001	0802d230	100941d1	100df868	00000002

so:

R0 = 00000000
R1 = 08032ffc
R2 = 10000000
R3 = e000e000
R12 = beb73e00
LR = 100917d5
PC = 1009189a
PSR = 61000000

At PC = 1009189a appears:

0x10091882 b.n	10091860 <xTaskDelayUntil+0x7c>
0x10091884 .word	0x0802cdf0
0x10091888 .word	0x0802ce74
0x1009188C .word	0x100b5d1c
0x10091890 .word	0x100b6094
0x10091894 .word	0x100b5bc0
0x10091898 .word	0x100b5d30
0x1009189C .word	0x100b5c78

and xTaskDelayUntil is exactly what I expect that task to be doing!

At LR = 100917d5 appears

0x100917B0 bl	10092eac <vPortExitCritical>
2317: }
0x100917B4 mov	r0, r4
0x100917B6 pop	{r3, r4, r5, pc}
0x100917B8 .word	0x0802cdf0
0x100917BC .word	0x0802cde8
0x100917C0 .word	0x100b5d04
0x100917C4 .word	0x100b6054
0x100917C8 .word	0x100b5bc0
0x100917CC .word	0x0802ce34
0x100917D0 .word	0x0802cdf8
0x100917D4 .word	0x0802cc50
0x100917D8 .word	0x0802cc44
0x100917DC .word	0x0802ce78
0x100917E0 .word	0x0802ce30

I don’t know what to make of that. But this at the top level of a task, so maybe there’s no real calling function.

Remaining questions:

So it looks like pxTopOfStack points to the next available frame rather than the last saved frame?
Is the Reserved word at offset 0x64 in an Extended Frame saved?
How do I know whether the TopOfStack frame is basic or extended?
How do I know whether the TopOfStack frame is 4-byte aligned or 8-byte aligned?

richard-damon · February 5, 2021, 9:52pm

Look in the code for the port you are using (typically a port.c). There will be the code that saves the register set to prepare to switch out the task (for the M4, see the function xPortPendSVHandler). Looking at that code, and the method used to call the scheduler (which for the M4 is a software pended interrupt) should let you trace back what should be on the stack.

carlk3 · February 6, 2021, 5:48pm

OK, I have studied and stepped through xPortPendSVHandler and I think I now understand what I am seeing.

The key instructions are:

tst r14, #0x10						\n" /* Is the task using the FPU context?  If so, push high vfp registers. */
vstmdbeq	r0!, {s16-s31} // 16 words
// VSTM DB 
// Floating-point Store Multiple Decrement Before EQ
// suffix: EQ flag: Z = 1 meaning: Equal

and

stmdb	r0!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} // 9 words
// "STore Multiple Decrement Before" pushes all the callee-saved core registers onto psp

So, I have 16 + 9 = 25 words pushed onto the stack, which is 100 (0x64) bytes. So that accounts for the 0x64 that I need to add to the pxTopOfStack to get to the R0, R1, R2, R3, R12, LR, PC, and PSR that I’m interested in. Note: I think the frame diagrams I showed above don’t apply here.

With that understanding, in general I need to determine whether or not the high vfp registers are there, to know whether I need to add 9 or 25 words (0x24 or 0x64 byte addresses) to the pxTopOfStack. I should be able to do this by looking at the lr at offset 8 from pxTopOfStack. This should contain an EXC_RETURN Value, and bit 5 will tell me whether or not there are floating-point registers (tst r14, #0x10).

That answers all my questions for now. In summary, what I plan to do is:

Have a watchdog counter for each task
Each task increments its counter at strategic places within the code
All of the blocking calls in the monitored tasks have timeouts so they can increment the counters even if they have nothing else to do.
A Software Timer will periodically check the counters to make sure that no task is hung up.
configTIMER_TASK_PRIORITY will remain at (configMAX_PRIORITIES - 2). In this case, that means that there is one real-time task that the watchdog can’t always handle, but the Software Timer callback watchdog can preempt any other task.
If the watchdog detects a hang, he will:
- capture (at least) the Program Counter (PC) and Link Register (LR) of the hung task
- write a fault record to flash memory for later analysis
- do some cleanup
- reset system

Thanks for your help, everyone!