Hi! We have been troubleshooting a problem for weeks now and could use some clues how
the behavior we are seeing could be happening. We suspect the issue comes from bad usage
of FreeRTOS, not from an actual bug in FreeRTOS.
System is the Cortex M7 coprocessor within an iMX8MP SoC. Using FreeRTOS 11.1.0, having tried the ARM_CM4F and the ARM_CM7 port.
Compiler is the ARM GNU Toolchain 13.2.Rel1.
What we see is that the firmware hangs reliably after 37 hours.
- SysTick and PendSV (numeric prio 15) aren’t called anymore.
- The single task we have also doesn’t run anymore.
It’s scheduled on delayed list via xQueueReceive(), but never woken again. - Timer ISR (numeric prio 11) is still getting called in its regular 50ms interval.
It does nothing but trying to wake the single task we have. The wake mechanism, like event groups,
task notification or simple booleans doesn’t seem to change anything. - Messaging Unit ISR (numeric prio 11) isn’t called anymore.
But it’s only called when Cortex-A53 sends data to M7.
It would supply our task with data via xQueueSendFromISR().
See the sequence diagram below on what we are approximately doing and our various findings.
Whether and which ISR/Task causes the others to stop, we don’t know.
We are aware of the inverse relationship between numeric priority and the logical priority.
We are aware of the interrupt priorites that we can use, and what APIs we can call from them.
This is the limits we have configured:
#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY 15
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 2
#define configKERNEL_INTERRUPT_PRIORITY (configLIBRARY_LOWEST_INTERRUPT_PRIORITY << 4)
#define configMAX_SYSCALL_INTERRUPT_PRIORITY (configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << 4)
Various clues and other things of importance.
Issue happens with -Os
optimization flag. Level -O2
and below seem fine. No long term test were yet performed however.
We can not attach a debugger when the system has stopped. We can attach a debugger before and let the system run into the halt.
Once there, we sometimes are able to get a callstack. But the debugger is unable to perform more actions and fails. Most of the information
was extracted via the still running timer ISR and writing into known and Cortex-A53 accessible memory areas.
Weirdly, the tickcount must be around 2^27 = 0x0800 0000 (± 100 dec) to trigger
the problem. We can set configINITIAL_TICK_COUNT
for not having to wait 37h @ 1ms tick rate. There is no direct usage of the tick count in our code or the SDK portions from what we have seen.
At the time the system halts:
- BASEPRI is 0
- uxSchedulerSuspended is 0
The configASSERT()
macro is defined and would inform us. We have also installed various exception handlers (hardfault, usage fault) should the MCU actually crash.
Our current “minimal reproducible example” is still too large to be posted.
It is stripped of most business logic (ProcessPrevious()
from the sequence diagram removed), but still contains the non-trivial and large rpmsg-lite library (_RpmsgLogic()
in the sequence diagram) to facilitate communication with the A53 core(s).
We’d be hoping someone to outright spot a mistake or just for ideas what constellation can lead to this behavior and any other debug advice to find the root cause.
Thanks in advance for any help you can give!