I am developing a plan to recover from potential device hangs. Specifically cases where a task becomes stuck in an infinite loop. My system periodically receives interrupt from another chip at fixed time intervals. My question is if I were stuck in a task and an interrupt was detected from the external chip is there a way of suspending the stuck task from ISR? I would record the failure (send pkt to another chip on board), then reset the system. This seems odd since program is executing task then want to suspend the task from ISR. Any feedback is helpful. Thanks.
While an ISR can suspend a task from within the ISR, that isn’t likely going to solve the issue, as the task will still just be sitting there, just not using any CPU time.
The key feature that you need to design into the task is some form of timeout. Unless it is legitimately expected that it might take a long time for a device to respond (like maybe a command terminal, might wait forever for the next command), most device operations should have definite timeout and handling, even for errors that you think should be ‘impossible’. If you force yourself to handle the impossible errors, you will naturally handle the very unlikely errors that you didn’t think of in the first place.
Your idea of a total system reset IS one possible solution, if the error means you have no idea what the state of the system is, or what would be needed to recover to a useful state, but should be a last resort, and only for failures of core functionality (and consideration of what the system might be doing, it may need a more controlled shutdown and restart)
Some months ago I wrote an entry blog about a software watchdog so that the system detects and recovers from hang-ups. It’s written in spanish, but Google might translate it for you to your mother tongue:
(Look for example number 4, too). Hope it helps.