FreeRTOS hangs - iMX8 CM7 RPMSG

jbaum · August 27, 2025, 2:33pm

Hi! We have been troubleshooting a problem for weeks now and could use some clues how
the behavior we are seeing could be happening. We suspect the issue comes from bad usage
of FreeRTOS, not from an actual bug in FreeRTOS.

System is the Cortex M7 coprocessor within an iMX8MP SoC. Using FreeRTOS 11.1.0, having tried the ARM_CM4F and the ARM_CM7 port.
Compiler is the ARM GNU Toolchain 13.2.Rel1.

What we see is that the firmware hangs reliably after 37 hours.

SysTick and PendSV (numeric prio 15) aren’t called anymore.
The single task we have also doesn’t run anymore.
It’s scheduled on delayed list via xQueueReceive(), but never woken again.
Timer ISR (numeric prio 11) is still getting called in its regular 50ms interval.
It does nothing but trying to wake the single task we have. The wake mechanism, like event groups,
task notification or simple booleans doesn’t seem to change anything.
Messaging Unit ISR (numeric prio 11) isn’t called anymore.
But it’s only called when Cortex-A53 sends data to M7.
It would supply our task with data via xQueueSendFromISR().

See the sequence diagram below on what we are approximately doing and our various findings.

Whether and which ISR/Task causes the others to stop, we don’t know.

We are aware of the inverse relationship between numeric priority and the logical priority.
We are aware of the interrupt priorites that we can use, and what APIs we can call from them.
This is the limits we have configured:

#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY 15
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 2
#define configKERNEL_INTERRUPT_PRIORITY (configLIBRARY_LOWEST_INTERRUPT_PRIORITY << 4)
#define configMAX_SYSCALL_INTERRUPT_PRIORITY (configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << 4)

Various clues and other things of importance.

Issue happens with -Os optimization flag. Level -O2 and below seem fine. No long term test were yet performed however.

We can not attach a debugger when the system has stopped. We can attach a debugger before and let the system run into the halt.
Once there, we sometimes are able to get a callstack. But the debugger is unable to perform more actions and fails. Most of the information
was extracted via the still running timer ISR and writing into known and Cortex-A53 accessible memory areas.

Weirdly, the tickcount must be around 2^27 = 0x0800 0000 (± 100 dec) to trigger
the problem. We can set configINITIAL_TICK_COUNT for not having to wait 37h @ 1ms tick rate. There is no direct usage of the tick count in our code or the SDK portions from what we have seen.

At the time the system halts:

BASEPRI is 0
uxSchedulerSuspended is 0

The configASSERT() macro is defined and would inform us. We have also installed various exception handlers (hardfault, usage fault) should the MCU actually crash.

Our current “minimal reproducible example” is still too large to be posted.
It is stripped of most business logic (ProcessPrevious() from the sequence diagram removed), but still contains the non-trivial and large rpmsg-lite library (_RpmsgLogic() in the sequence diagram) to facilitate communication with the A53 core(s).

We’d be hoping someone to outright spot a mistake or just for ideas what constellation can lead to this behavior and any other debug advice to find the root cause.

Thanks in advance for any help you can give!

aggarg · August 28, 2025, 4:48am

The above things mean that interrupts are not masked. Is there any place in your code where you stop SysTick? Are you using tickless idle?

Does setting configINITIAL_TICK_COUNT or increasing tick rate helps in producing the problem faster?

Can you share that callstack?

Please share the definition of configASSERT().

RAc · August 28, 2025, 8:05am

Hi there jbaum,

first of all, thanks for the well prepared and concise but comprehensive problem report. Would that just 10% of the users would do their preinvestigations half as thoroughly as you.

As @aggarg already asked - do I understand above statement correctly that the “magic number” 2^27 is a sufficient condition to repro the problem? I so, can you set a hardware breakpoint to that value on the sys tick variable and inspect the system state at that point (in particular the hardware registers)? Can you run your target under control of tracealyzer and take a snapshot at that time?

jbaum · August 28, 2025, 8:27am

Thanks for the response!

The above things mean that interrupts are not masked. Is there any place in your code where you stop SysTick? Are you using tickless idle?

We do not use tickless idle and are not aware of anyone stopping the SysTick. In fact, SysTick is enabled, but it’s pending? See the registers below:

SYST_CSR = 0x00010007
COUNTFLAG = 1
CLKSOURCE = 1
TICKINT = 1
ENABLE = 1

ICSR = 0x1440E047
PENDSVSET = 1
PENDSTSET = 1
ISRPENDING = 1
VECTPENDING = 0b1110 = 14 = PendSV
VECTACTIVE = 71 = GPT1 (which is used to dump this info)

Does setting configINITIAL_TICK_COUNT or increasing tick rate helps in producing the problem faster?

Yes, both do. Which is how we came to set the inital value as to not having to wait that long.

Can you share that callstack?

It’s from our single task, the last route that it takes. We don’t have any from when the ISRs are called. It looks fine to us. The task is set to sleep because there’s nothing in the queue.

our_code_receive() (Our code)
  rpmsg_queue_recv (SDK)
      env_get_queue (SDK)
        xQueueReceive (FreeRTOS)
          taskYIELD_WITHIN_API(); (after xTaskCheckForTimeOut() == pdFALSE and                      
                                   prvIsQueueEmpty() != pdFALSE)
            portNVIC_INT_CTRL_REG = portNVIC_PENDSVSET_BIT;

Please share the definition of configASSERT()

#define configASSERT( x )         \
    if( ( x ) == 0 )              \
    {                             \
        *((uint32_t*)0x70FFF630) = 0xDEADBEEF; \
        DbgConsole_Printf("configAssert failed, file: %s [%d]\r\n", __FILE__, __LINE__); \
        taskDISABLE_INTERRUPTS(); \
        for( ; ; )                \
        ;                         \
    }

That magic address we are writing here, is space our DDR RAM and can be observed post-mortem by the A53 cores. In case DbgConsole_Printf() would not work in whatever state the system is. Just for debugging ofc, and hasn’t triggered yet.

jbaum · August 28, 2025, 8:52am

first of all, thanks for the well prepared and concise but comprehensive problem report. Would that just 10% of the users would do their preinvestigations half as thoroughly as you.

Thanks for the feedback I really try to get the most out of the (understandably) limited attention and time I can receive on this matter.

As @aggarg already asked - do I understand above statement correctly that the “magic number” 2^27 is a sufficient condition to repro the problem? I so, can you set a hardware breakpoint to that value on the sys tick variable and inspect the system state at that point (in particular the hardware registers)? Can you run your target under control of tracealyzer and take a snapshot at that time?

It seems to be one of the conditions. Without calling the rpmsg-lite library (from the MCUXpresso) we can’t seem to reproduce the issue. Hence our assumption that its a usage or config issue, not a bug in FreeRTOS.

We are just running our of working theories as to what could cause this issue. Bad ISR priorities, non-ISR API call from ISR, broken critical section, 0x7FFFFFF/0x8000000 masks .. we looked for them all without success.

We did run tracealyzer, with a local buffer (we can’t stream), but I must admit that we did not dive too deep into what the tool and trace offers. Note that the system halts only around that tick count. There is a 200ms windows we’d have to capture. If nothing comes up in this discussions here, we’ll probably have to try harder with *tracealyzer.

In my reponse to Gaurav, you can see the state of the ICSR and SYST_CSR registers for the meanwhile.

RAc · August 28, 2025, 9:01am

So do you have a tracealyzer memory dump of which you know that it covers the point where the problem occurred? If so, could you share it or send it to one of us via PM?

jbaum · August 28, 2025, 11:53am

I made new traces just now and verified that the relvant portion is included.
Unfortunately I can’t attach them because I am a new user.

Maybe my account can be unlocked @aggarg ?

I do not see a PM feature here, but I could also send the traces via e-mail if you’d be willing to share it.

Edit: I uploaded it here for now > https:// basedbin.fly.dev/uL2zMv.zip <
I can also properly upload it in the forum once my account is unlocked for that.
Edit: Uploaded the files here too, now that my account was feature unlocked.

MemDumps.zip (369.9 KB)

Thanks in advance!

RAc · August 28, 2025, 1:12pm

you can click on any user’s avatar, there should be a green “message” button visible.

But making the files public makes more sense because then more pairs of eyes have a chance to look over them.

RAc · August 28, 2025, 1:32pm

ok, here is something that MIGHT be a pointer… there is a call to vTaskDelayUntil that appears to attempt to wait to a point in the past. Just a shot in the dark though.

This is one of the last entries before the hang.

EDIT: Looking closer at what happens, it appears that the call to vTaskDelay() is made inside the timer task. If you do not have timer callbacks, then this must be inside the timer processing logic; if you do, I am certain you know that it is unwise to put any delay/suspend code in the timer callback as you may starve all pending timers.

RAc · August 28, 2025, 2:19pm

ok, that was probably barking up the wrong tree - all time stamps attributed to vTaskDelayUntil() appear to be annotated backwards, so that may not be your problem after all. But I observed that there are several instances over time where nothing, not even ISRs, take up CPU time, only the idle task. Whenever that scenarion is resolved, everything appears to be triggered by whatever the MU_ISR does. So it looks as if that ISR stops firing. I would focus my analysis on why that is.

jbaum · August 28, 2025, 2:32pm

There are certainly periods where nothing is to be done.

MU_ISR supplies new data to the program.
TIMER_ISR periodically wakes OUTPUT_TASK.
OUTPUT_TASK outputs data on this fixed interval, and afterwards updates its state by whatever MU_ISR supplied. Then it goes to sleep, waiting for the next interval. This is when the IDLE task likely comes into play.

But even if there is no new data, TIMER_ISR (which is still running) should be able to wake OUTPUT_TASK eventually and have it output (old) data. So the absence of MU_ISR should not halt the system. Broken logic in MU_ISR kills the system would be my bet, because thats where most code is (inter-core communication via the “Messaging Unit (MU)” peripheral and shared memory. We just can’t find it

RAc · August 28, 2025, 2:55pm

ok, I am afraid I can’t help much further, but one last observation: In your …7fff817 log, the system itself is not completly hung, it sort of restarts after several idle seconds, but the pattern after “wakeup” it very different from the other idle instances. As you write, the idle task triggers something, but again, the MU_ISR appears to be stalled. Can you determine if the ISR itself is enabled when the hang happens - on both NVIC and device levels?

jbaum · August 28, 2025, 3:02pm

I’ll go though all your observations and see if there is something to them tomorrow. Thanks for your help Maybe @aggarg also has some insight, then I’d have enough leads for the next few days to investigate.

aggarg · August 29, 2025, 6:27am

You should now be able to upload.

If the SysTick interrupt stops, from FreeRTOS’s perspective, time has stopped, which can manifest as a system halt. However, the fact that the TIMER ISR continues to fire while both SysTick and PendSV interrupts are pending raises an interesting question. Is it possible that you’re stuck in an ISR that has higher priority than SysTick and lower priority than the TIMER ISR? Would you try the TIMER ISR priority to match the SysTick priority level and see if the TIMER ISR continues to fire?

I’d suggest to comment out some portions of code to narrow down the problematic part.

RAc · August 29, 2025, 7:27am

The tracealyzer file does not indicate that unless there are ISRs that happen to not be monitored by tracealyzer…

aggarg · August 31, 2025, 5:43am

Thank you for pointing that! I did not examine the Tracelyzer file.

escherstair · September 9, 2025, 6:29am

Hi,

this issue seems really the same as those I found and described here.

I haven’t been able to provide traces, but @jbaum did it in this topic.

As far as I can see this is not an easy task to investigate (and to solve).

Did you get any idea?

Thanks

RAc · September 9, 2025, 8:57am

what makes you think that the issues are related? They look very different to me.

escherstair · September 9, 2025, 9:02am

The M7 side hangs/crashes (I don’t see exactly what happens because when I connect with Segger JLink the core is reset) after a fixed amount of time (more than 32 hours and less than 48) when:

rpmsg is used
FreeRTOS is used
an external periodic interrupt is enabled

I don’t know if the FreeRTOS counter is the issue (I’ll check), but all the symptoms are the same.

Every change in the application side doesn’t change the time of the issue.

Disabling the external interrupt is the only way to have M7 running for at least 48 hours.

I cannot remove rpmsg, nor replace FreeRTOS with another OS to double check this

RAc · September 9, 2025, 9:09am

I do not think they are necessarily related. Yours manifests in a crash, JBaums in an apparent failure within the MU to fire interrupts. But any clue is worth looking into in a case like that…

Topic		Replies	Views
Yield from ISR and Tick Interrupt Collision? Kernel	10	602	February 2, 2012
vTaskDelay cause system halt Kernel	42	986	August 25, 2013
FreeRTOS 7.0.1 + MSP430F5438 does not work Kernel	11	256	June 22, 2011
Can't get scheduler to run anymore Kernel	25	544	April 27, 2013
Software Timers in FreeRTOS v8.2 Kernel	38	879	March 14, 2015

FreeRTOS hangs - iMX8 CM7 RPMSG

Various clues and other things of importance.

Related topics