Software Watchdog Design

I’m looking at implementing a watchdog to detect hangs in any of several tasks. I’m thinking along these lines:

  • Have a watchdog counter for each task
  • Each task increments a counter at strategic places within the code
  • A Software Timer (or Tick Hook?) periodically checks the counters to make sure that no task is hung up

In case a hang (of one or more tasks) is detected, the watchdog

  • Captures a fault record (say, in flash)
  • Resets the system

For the fault record, I’d like to be able to capture, say, the Program Counter (PC) and Link Register (LR) of the hung task. The watchdog knows which task is hung. Can it go into the TasksLists and find the stack frame? [Follow the TaskHandle to pxTopOfStack?] Is there any example code around?

This is a cut and paste of a reply I hurriedly gave to a similar question (in a different channel) recently. Probably something we should write a blog about:

I’m going to give my standard engineers annoying answer and say “it depends on the use case” but in general the watchdog is likely going to want to check:

1. All important tasks are still executing and getting enough CPU time.
2. The scheduler is still executing.
3. The Idle task is getting some CPU time too, showing that none of the higher priority application tasks are stuck in any loops.
4. Interrupts are still executing.

*One way to do that is have all of the above keep an execution (or cycle) count. This can be simply and incrementing volatile unsigned int. For example - if you have a task that must run at least once every 50ms, then when that task executes have it programmatically check that not more than 50ms passed since it last executed (it can do that manually by reading the tick count) - and if the time is breached, latch an error in a variable, otherwise increment that task’s execution/cycle counter.

The watchdog kicking is then performed from a monitor task that runs periodically (or an interrupt that runs periodically). The monitor task only kicks the watchdog if no errors have been latched (everything is healthy) and all the execution/cycle counters are still incrementing (all tasks are still running - so not stuck in an infinite loop, deadlocked, or starved of execution time). If those conditions aren’t met the watchdog isn’t kicked and times out - resetting the system (or the task can reset manually if it has that capability).

There are other ways too - for example you can have tasks set a bit in an event group then periodically check all the expected bits are set in the event group before clearing all the bits again - that shows the tasks expected to run did run (because they set their bit) but doesn’t say how often they ran and so doesn’t guard against tasks having invalid timing, or getting stuck after setting their bit. You then also need to decide what to do when the system reboots - some systems will tell you the reason for last reset - if the last reset was a watchdog timeout then you might want to take an action.

Hello,
For purpose of implementing watchdog in multi-tasking project there is some useful information for better understanding:

Hope that helps.

Would a watchdog implemented as a Software Timer callback be able to run if one of the monitored tasks is in a tight loop? I currently have

#define configTIMER_TASK_PRIORITY (configMAX_PRIORITIES - 2)

Also,

#define configUSE_PREEMPTION 1

and

#define configUSE_TIME_SLICING 1

If a watchdog (in another task) has a TaskHandle to a task that he suspects is hung, how can he get a stack frame for that task? pxTopOfStack? Does that point to the current stack frame or the next available stack frame? If the latter, how can he back up to the current stack frame? This project (on Cortex M4F) does use the floating point unit. Are there any code samples around?

The exact meaning of pxTopOfStack is somewhat port dependent, but will tend to be the value of the stack pointer at the point when the task was last interrupted and switched out, after saving all the context. You can look at the port code to see how a task is switched out and state saved to see what the stack structure looks like.

Note, if you are checking as part of an interrupt, then the TCB for the current task will not be up to date, its stack pointer would be gotten from the ISR stack frame.

As long at the task in the tight loop doesn’t have a higher priority than the timer task, the timer callback should run. Since you have the timer task at configMAX_PRIORITIES-2, a task at configMAX_PRIORITIES-1 could block it by running.

Also, if another timer callback hangs, then your watchdog callback wouldn’t get a chance to run.

If I am checking from a Software Timer callback I should be running in the context of the RTOS daemon (or ‘timer service’) task, so the TCB for the task I want to examine should be up to date, right?

Makes sense. configMAX_PRIORITIES is really the number of priorities, right? So configMAX_PRIORITIES-1 is as high as it can get? I do have one real-time task set to configMAX_PRIORITIES-1, so I guess I would need to adjust that… which means I will have to review my other Software Timers for any impacts to the real-time task.

The fact that your top priority task could block this watchdog might not be a reason to move up the timer task. One of the reasons that the timer task has a programable priority is just for such reasons, and if the timer task could give you problems with that task meeting its requirement could be a real good reason to let it be higher. I suppose part of the question is how likely is that task to actually run indefinitely starving the rest of the system. Such ultra priority tasks tend to be fairly simple and easy to verify against such problems.

As to you other question, yes, if your code is in a task, then all other tasks will be up to date.

As an experiment, I attached a debugger, set a breakpoint in one of my tasks, and then looked at the memory pointed to by the TaskHandle of another task:

0x08032ff8 080329e4 032c22c4 08039744 0802ce04 08032ff8 0802cdfc 00000010 00000000 .)…",.D…/…
0x08033018 00000000 08032ff8 00000000 00000004 080324a8 54626956 006b7361 00000000 …/…$…VibTask…
0x08033038 00000000 00000008 00000000 00000004 00000000 00000000 00000002 08040c80
0x08033058 8302001f 00000000 000b5bdb 00000000 00000000 08033354 080333bc 08033424 …[…T3…3…$4…
0x08033078 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Comparing it to this (Cortex-M reference manual) diagram:


or this diagram:

I can’t make heads nor tails of it. It looks to me like:

R0 = 0x08032ff8
R1 = 080329e4
R3 = 032c22c4
R12 = 08039744
LR = 0802ce04
PC = 08032ff8
PSR = 0802cdfc

I was hoping to find something useful in PC and LR that would help me locate a hang. On this system, the code is in flash which is 0x10080000 - 0x10100000. PC = 0x08032ff8 and LR = 0x0802ce04 would be pointing into RAM.

Oh, wait, I need to follow one more level of indirection. The TaskHandle has value 0x08032FF8, but that’s pointing to the TCB, which is what I showed above. So, the first field of the TCB is pxTopOfStack, or 0x080329e4. If I look there:

0x080329e4 00000001 08032ac8 00000000 0802d188 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5
0x08032a04 ffffffed 3e800000 469c4000 ffdedbfd feff77f3 bbfff7bf 37ff95f7 fec75fff
0x08032a24 facfffff fe5fffff ffffdf76 ffff97fe ffde5ff3 bedfdff7 fefdf5f7 fed7f5ff
0x08032a44 beffdbff 00000000 08032ffc 10000000 e000e000 beb73e00 100917d5 1009189a
0x08032a64 61000000 ffddfbff dffdf7e7 fd5ff7ee fadf5577 bf800000 3e178897 3e1cd04f

so

R0 = 00000001
R1 = 08032ac8
R2 = 00000000
R3 = 0802d188
R12 = a5a5a5a5
LR = a5a5a5a5
PC = a5a5a5a5
PSR = a5a5a5a5

Still not helpful.

I think I’m getting warmer: if I add 0x64 to the topOfStack, for (one less than?) the size of an 8-byte aligned Extended Stack Frame, I get to

0x08032a48 00000000 08032ffc 10000000 e000e000 beb73e00 100917d5 1009189a 61000000
0x08032a68 ffddfbff dffdf7e7 fd5ff7ee fadf5577 bf800000 3e178897 3e1cd04f 3c96f501
0x08032a88 3e3a3326 3e638e2a 3e924926 39b2080f 3f317180 3f15f6d9 3f800000 3ed79000
0x08032aa8 20000010 10091885 00000008 00000001 0802d230 100941d1 100df868 00000002

so:

R0 = 00000000
R1 = 08032ffc
R2 = 10000000
R3 = e000e000
R12 = beb73e00
LR = 100917d5
PC = 1009189a
PSR = 61000000

At PC = 1009189a appears:

0x10091882 b.n 10091860 <xTaskDelayUntil+0x7c>
0x10091884 .word 0x0802cdf0
0x10091888 .word 0x0802ce74
0x1009188C .word 0x100b5d1c
0x10091890 .word 0x100b6094
0x10091894 .word 0x100b5bc0
0x10091898 .word 0x100b5d30
0x1009189C .word 0x100b5c78

and xTaskDelayUntil is exactly what I expect that task to be doing!

At LR = 100917d5 appears

0x100917B0 bl	10092eac <vPortExitCritical>
2317: }
0x100917B4 mov	r0, r4
0x100917B6 pop	{r3, r4, r5, pc}
0x100917B8 .word	0x0802cdf0
0x100917BC .word	0x0802cde8
0x100917C0 .word	0x100b5d04
0x100917C4 .word	0x100b6054
0x100917C8 .word	0x100b5bc0
0x100917CC .word	0x0802ce34
0x100917D0 .word	0x0802cdf8
0x100917D4 .word	0x0802cc50
0x100917D8 .word	0x0802cc44
0x100917DC .word	0x0802ce78
0x100917E0 .word	0x0802ce30

I don’t know what to make of that. But this at the top level of a task, so maybe there’s no real calling function.

Remaining questions:

  • So it looks like pxTopOfStack points to the next available frame rather than the last saved frame?
  • Is the Reserved word at offset 0x64 in an Extended Frame saved?
  • How do I know whether the TopOfStack frame is basic or extended?
  • How do I know whether the TopOfStack frame is 4-byte aligned or 8-byte aligned?

Look in the code for the port you are using (typically a port.c). There will be the code that saves the register set to prepare to switch out the task (for the M4, see the function xPortPendSVHandler). Looking at that code, and the method used to call the scheduler (which for the M4 is a software pended interrupt) should let you trace back what should be on the stack.

OK, I have studied and stepped through xPortPendSVHandler and I think I now understand what I am seeing.

The key instructions are:

tst r14, #0x10						\n" /* Is the task using the FPU context?  If so, push high vfp registers. */
vstmdbeq	r0!, {s16-s31} // 16 words
// VSTM DB 
// Floating-point Store Multiple Decrement Before EQ
// suffix: EQ flag: Z = 1 meaning: Equal

and

stmdb	r0!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} // 9 words
// "STore Multiple Decrement Before" pushes all the callee-saved core registers onto psp 

So, I have 16 + 9 = 25 words pushed onto the stack, which is 100 (0x64) bytes. So that accounts for the 0x64 that I need to add to the pxTopOfStack to get to the R0, R1, R2, R3, R12, LR, PC, and PSR that I’m interested in. Note: I think the frame diagrams I showed above don’t apply here.

With that understanding, in general I need to determine whether or not the high vfp registers are there, to know whether I need to add 9 or 25 words (0x24 or 0x64 byte addresses) to the pxTopOfStack. I should be able to do this by looking at the lr at offset 8 from pxTopOfStack. This should contain an EXC_RETURN Value, and bit 5 will tell me whether or not there are floating-point registers (tst r14, #0x10).

That answers all my questions for now. In summary, what I plan to do is:

  • Have a watchdog counter for each task

  • Each task increments its counter at strategic places within the code

  • All of the blocking calls in the monitored tasks have timeouts so they can increment the counters even if they have nothing else to do.

  • A Software Timer will periodically check the counters to make sure that no task is hung up.

  • configTIMER_TASK_PRIORITY will remain at (configMAX_PRIORITIES - 2). In this case, that means that there is one real-time task that the watchdog can’t always handle, but the Software Timer callback watchdog can preempt any other task.

  • If the watchdog detects a hang, he will:

    • capture (at least) the Program Counter (PC) and Link Register (LR) of the hung task
    • write a fault record to flash memory for later analysis
    • do some cleanup
    • reset system

Thanks for your help, everyone!