FreeRTOS total system failure

Hi All

I’ve been dropped on an issue with a system running FreeRTOS on Zynq 7000 that seems really difficult to pin down. Long story short, the system dies completely after a sequence of user steps - by that I mean no output on serial, JTAG debugger stops responding, all LEDs and HMI becomes frozen. After a lot of printfs, I’ve narrowed down a spot just before the freeze happens - a call to take a semaphore (with inf blocking time). It’s always the same place, even with a scaled down version of the firmware (i.e. some tasks not started). I’ve set a watchpoint on the aforementioned semaphore, hoping to see memory corruption, but no issue there.
Any ideas on what could cause a catastrophic failure like this and how to debug it?
All comments appreciated.

Does it happen in the same place on all hardware units - or just one?

If you are killing the JTAG, then some likely causes are a wild write that it hitting a control register that is breaking the JTAG, or doing something that is interfering with your system clock. “Normal” software operation shouldn’t be able to disable JTAG.

Yes, exactly the same place.
One thing I forgot to mention is that the watchdog on the arm processor becomes dead as well - in less severe problems the product at least reboots on its own.

If the watchdog stops too then I’m suspicious of a power problem. Can you answer my question as to whether this happens on just one device or all devices.

@rtel Yes, it happens on all devices.
I’ve now found that watchdog doesn’t work if it’s configured as a timer (i.e. Triggers an interrupt - the way it’s implemented now). When I change it to configured as watchdog then reboot is happening. I tried setting a watch point over ‘reset status’ register, but if triggers after the system already restarted.

If its happening after a lot of printf’s, could you potentially be overflowing a buffer and perhaps overwriting something to do with the semaphore, resulting in a bad pointer or some such?

Can you perhaps take a copy of the semaphores memory before and after the crash - if youre able to do the later once it has crashed - to see if something has changed drastically?

But perhaps this could be ruled out simply by looking at where the semaphore pointer/memory is allocated in relation to the buffer used by printf. If the former is directly after the later, perhaps there is some credibility to this.

Another option is monitoring stack usage in your tasks. Some printf implementations, I am led to believe, can be quite memory hungry, and maybe youre blowing out the stack of a task and overwriting something critical.

Thanks, but printf isn’t the issue here, the system freezes with/without it.

I’ve had problems where the semaphore is still a null pointer. This freezes things on an ARM processor, but doesn’t kill the debugging. The debugging, however, is not JTAG in this case.