FreeRTOS hangs - iMX8 CM7 RPMSG

Hi! We have been troubleshooting a problem for weeks now and could use some clues how
the behavior we are seeing could be happening. We suspect the issue comes from bad usage
of FreeRTOS, not from an actual bug in FreeRTOS.

System is the Cortex M7 coprocessor within an iMX8MP SoC. Using FreeRTOS 11.1.0, having tried the ARM_CM4F and the ARM_CM7 port.
Compiler is the ARM GNU Toolchain 13.2.Rel1.

What we see is that the firmware hangs reliably after 37 hours.

  • SysTick and PendSV (numeric prio 15) aren’t called anymore.
  • The single task we have also doesn’t run anymore.
    It’s scheduled on delayed list via xQueueReceive(), but never woken again.
  • Timer ISR (numeric prio 11) is still getting called in its regular 50ms interval.
    It does nothing but trying to wake the single task we have. The wake mechanism, like event groups,
    task notification or simple booleans doesn’t seem to change anything.
  • Messaging Unit ISR (numeric prio 11) isn’t called anymore.
    But it’s only called when Cortex-A53 sends data to M7.
    It would supply our task with data via xQueueSendFromISR().

See the sequence diagram below on what we are approximately doing and our various findings.

Whether and which ISR/Task causes the others to stop, we don’t know.

We are aware of the inverse relationship between numeric priority and the logical priority.
We are aware of the interrupt priorites that we can use, and what APIs we can call from them.
This is the limits we have configured:

#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY 15
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 2
#define configKERNEL_INTERRUPT_PRIORITY (configLIBRARY_LOWEST_INTERRUPT_PRIORITY << 4)
#define configMAX_SYSCALL_INTERRUPT_PRIORITY (configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << 4)

Various clues and other things of importance.

Issue happens with -Os optimization flag. Level -O2 and below seem fine. No long term test were yet performed however.

We can not attach a debugger when the system has stopped. We can attach a debugger before and let the system run into the halt.
Once there, we sometimes are able to get a callstack. But the debugger is unable to perform more actions and fails. Most of the information
was extracted via the still running timer ISR and writing into known and Cortex-A53 accessible memory areas.

Weirdly, the tickcount must be around 2^27 = 0x0800 0000 (± 100 dec) to trigger
the problem. We can set configINITIAL_TICK_COUNT for not having to wait 37h @ 1ms tick rate. There is no direct usage of the tick count in our code or the SDK portions from what we have seen.

At the time the system halts:

  • BASEPRI is 0
  • uxSchedulerSuspended is 0

The configASSERT() macro is defined and would inform us. We have also installed various exception handlers (hardfault, usage fault) should the MCU actually crash.

Our current “minimal reproducible example” is still too large to be posted.
It is stripped of most business logic (ProcessPrevious() from the sequence diagram removed), but still contains the non-trivial and large rpmsg-lite library (_RpmsgLogic() in the sequence diagram) to facilitate communication with the A53 core(s).

We’d be hoping someone to outright spot a mistake or just for ideas what constellation can lead to this behavior and any other debug advice to find the root cause.

Thanks in advance for any help you can give!

The above things mean that interrupts are not masked. Is there any place in your code where you stop SysTick? Are you using tickless idle?

Does setting configINITIAL_TICK_COUNT or increasing tick rate helps in producing the problem faster?

Can you share that callstack?

Please share the definition of configASSERT().

Hi there jbaum,

first of all, thanks for the well prepared and concise but comprehensive problem report. Would that just 10% of the users would do their preinvestigations half as thoroughly as you.

As @aggarg already asked - do I understand above statement correctly that the “magic number” 2^27 is a sufficient condition to repro the problem? I so, can you set a hardware breakpoint to that value on the sys tick variable and inspect the system state at that point (in particular the hardware registers)? Can you run your target under control of tracealyzer and take a snapshot at that time?

Thanks for the response!

The above things mean that interrupts are not masked. Is there any place in your code where you stop SysTick? Are you using tickless idle?

We do not use tickless idle and are not aware of anyone stopping the SysTick. In fact, SysTick is enabled, but it’s pending? See the registers below:

SYST_CSR = 0x00010007
COUNTFLAG = 1
CLKSOURCE = 1
TICKINT = 1
ENABLE = 1

ICSR = 0x1440E047
PENDSVSET = 1
PENDSTSET = 1
ISRPENDING = 1
VECTPENDING = 0b1110 = 14 = PendSV
VECTACTIVE = 71 = GPT1 (which is used to dump this info)

Does setting configINITIAL_TICK_COUNT or increasing tick rate helps in producing the problem faster?

Yes, both do. Which is how we came to set the inital value as to not having to wait that long.

Can you share that callstack?

It’s from our single task, the last route that it takes. We don’t have any from when the ISRs are called. It looks fine to us. The task is set to sleep because there’s nothing in the queue.

our_code_receive() (Our code)
  rpmsg_queue_recv (SDK)
      env_get_queue (SDK)
        xQueueReceive (FreeRTOS)
          taskYIELD_WITHIN_API(); (after xTaskCheckForTimeOut() == pdFALSE and                      
                                   prvIsQueueEmpty() != pdFALSE)
            portNVIC_INT_CTRL_REG = portNVIC_PENDSVSET_BIT;

Please share the definition of configASSERT()

#define configASSERT( x )         \
    if( ( x ) == 0 )              \
    {                             \
        *((uint32_t*)0x70FFF630) = 0xDEADBEEF; \
        DbgConsole_Printf("configAssert failed, file: %s [%d]\r\n", __FILE__, __LINE__); \
        taskDISABLE_INTERRUPTS(); \
        for( ; ; )                \
        ;                         \
    }

That magic address we are writing here, is space our DDR RAM and can be observed post-mortem by the A53 cores. In case DbgConsole_Printf() would not work in whatever state the system is. Just for debugging ofc, and hasn’t triggered yet.

first of all, thanks for the well prepared and concise but comprehensive problem report. Would that just 10% of the users would do their preinvestigations half as thoroughly as you.

Thanks for the feedback :slight_smile: I really try to get the most out of the (understandably) limited attention and time I can receive on this matter.

As @aggarg already asked - do I understand above statement correctly that the “magic number” 2^27 is a sufficient condition to repro the problem? I so, can you set a hardware breakpoint to that value on the sys tick variable and inspect the system state at that point (in particular the hardware registers)? Can you run your target under control of tracealyzer and take a snapshot at that time?

It seems to be one of the conditions. Without calling the rpmsg-lite library (from the MCUXpresso) we can’t seem to reproduce the issue. Hence our assumption that its a usage or config issue, not a bug in FreeRTOS.

We are just running our of working theories as to what could cause this issue. Bad ISR priorities, non-ISR API call from ISR, broken critical section, 0x7FFFFFF/0x8000000 masks .. we looked for them all without success.

We did run tracealyzer, with a local buffer (we can’t stream), but I must admit that we did not dive too deep into what the tool and trace offers. Note that the system halts only around that tick count. There is a 200ms windows we’d have to capture. If nothing comes up in this discussions here, we’ll probably have to try harder with *tracealyzer.

In my reponse to Gaurav, you can see the state of the ICSR and SYST_CSR registers for the meanwhile.

So do you have a tracealyzer memory dump of which you know that it covers the point where the problem occurred? If so, could you share it or send it to one of us via PM?

I made new traces just now and verified that the relvant portion is included.
Unfortunately I can’t attach them because I am a new user.

Maybe my account can be unlocked @aggarg ?

I do not see a PM feature here, but I could also send the traces via e-mail if you’d be willing to share it.

Edit: I uploaded it here for now > https:// basedbin.fly.dev/uL2zMv.zip <
I can also properly upload it in the forum once my account is unlocked for that.
Edit: Uploaded the files here too, now that my account was feature unlocked.

MemDumps.zip (369.9 KB)

Thanks in advance!

you can click on any user’s avatar, there should be a green “message” button visible.

But making the files public makes more sense because then more pairs of eyes have a chance to look over them.

ok, here is something that MIGHT be a pointer… there is a call to vTaskDelayUntil that appears to attempt to wait to a point in the past. Just a shot in the dark though.

This is one of the last entries before the hang.

EDIT: Looking closer at what happens, it appears that the call to vTaskDelay() is made inside the timer task. If you do not have timer callbacks, then this must be inside the timer processing logic; if you do, I am certain you know that it is unwise to put any delay/suspend code in the timer callback as you may starve all pending timers.

ok, that was probably barking up the wrong tree - all time stamps attributed to vTaskDelayUntil() appear to be annotated backwards, so that may not be your problem after all. But I observed that there are several instances over time where nothing, not even ISRs, take up CPU time, only the idle task. Whenever that scenarion is resolved, everything appears to be triggered by whatever the MU_ISR does. So it looks as if that ISR stops firing. I would focus my analysis on why that is.

There are certainly periods where nothing is to be done.

  • MU_ISR supplies new data to the program.
  • TIMER_ISR periodically wakes OUTPUT_TASK.
  • OUTPUT_TASK outputs data on this fixed interval, and afterwards updates its state by whatever MU_ISR supplied. Then it goes to sleep, waiting for the next interval. This is when the IDLE task likely comes into play.

But even if there is no new data, TIMER_ISR (which is still running) should be able to wake OUTPUT_TASK eventually and have it output (old) data. So the absence of MU_ISR should not halt the system. Broken logic in MU_ISR kills the system would be my bet, because thats where most code is (inter-core communication via the “Messaging Unit (MU)” peripheral and shared memory. We just can’t find it :frowning:

ok, I am afraid I can’t help much further, but one last observation: In your …7fff817 log, the system itself is not completly hung, it sort of restarts after several idle seconds, but the pattern after “wakeup” it very different from the other idle instances. As you write, the idle task triggers something, but again, the MU_ISR appears to be stalled. Can you determine if the ISR itself is enabled when the hang happens - on both NVIC and device levels?

I’ll go though all your observations and see if there is something to them tomorrow. Thanks for your help :slight_smile: Maybe @aggarg also has some insight, then I’d have enough leads for the next few days to investigate.

You should now be able to upload.

If the SysTick interrupt stops, from FreeRTOS’s perspective, time has stopped, which can manifest as a system halt. However, the fact that the TIMER ISR continues to fire while both SysTick and PendSV interrupts are pending raises an interesting question. Is it possible that you’re stuck in an ISR that has higher priority than SysTick and lower priority than the TIMER ISR? Would you try the TIMER ISR priority to match the SysTick priority level and see if the TIMER ISR continues to fire?

I’d suggest to comment out some portions of code to narrow down the problematic part.

The tracealyzer file does not indicate that unless there are ISRs that happen to not be monitored by tracealyzer…

Thank you for pointing that! I did not examine the Tracelyzer file.