FreeRTOS hangs - iMX8 CM7 RPMSG

escherstair · November 18, 2025, 8:52am

@MichalPrincNXP , @jbaum, @qtprashleigh I have final results of my test.

The size of the built application is not a factor.

There is only one function in the built application that cannot be moved.

If I move it I get the crash. I added some nop()where necessary to keep all the other functions at same location while investigating.

This function is MU1_M7_IRQHandler()

I’m quite sure that the concept is moving it respect to something else (but I don’t know what is this “something else”).

The vector table is stored at

.interrupts 0x80000000 0x2a8
0x80000000 __VECTOR_TABLE = .
0x80000000 __Vectors = .
0x80000000 . = ALIGN (0x4)

If MU1_M7_IRQHandler() is placed at 0x80010908I don’t see the crash.

If the linker places it at 0x800109d8 (i.e. 208 bytes after) I see the crash. And the time for the crash depends on the value of configINITIAL_TICK_COUNT

Does this give an idea to someone?

I don’t know what to do now…

RAc · November 18, 2025, 1:33pm

fascinating… what if you move it to some other address? Is ox800109d8 the only crashing address, or is there a range or a bit pattern that suffices?

Can you put a hardware bp on fetch on the address to try to figure out whether anything else but the interrupt attempts to fetch it?

richard-damon · November 18, 2025, 2:19pm

Have you checked the alignment requirements of the interrupt table on that processor? At a quick search, the generic requirement for the CM7 is 32 bytes (0x20), but your shift doesn’t meet that requirement.

The problem is that the interrupt vector relocation register doesn’t implement all bits, but leaves some lower ones forced to be zero.

escherstair · November 18, 2025, 2:30pm

A lot of addresses crash. It seems that 0x80010908 it’s a lucky one (and it doesn’t crash). But I cannot say more because this kind of test require a lot of time (and few of us seems interested in it - even if the issue seems not to be in our customer code).

I don’t know how to do this.

The vector table itself is at 0x80000000. Is there any maximum distance between the vector table itself and the address of the several interrupt handler?

Or for the jump between the item in the vector table and the interrupt handler itself?

Mybe it’s not clear what I did: I’ve been lucky and I had two different applications. One of them crashes and the other one doesn’t crash. The difference in my source code is small. And so I started comparing the two map files. The two address are the ones where the linker placed the handler automatically.

I played with the nop() only to forcing the linker moving all the other functions around.

Months of job from my side (not weeks or days).

RAc · November 18, 2025, 2:52pm

Hm ok, I am not 100% sure if the DWT of the M7 is fully backward compatible with the M4, but on the M4 you would use the DWT and do something like this - this code fragment breaks into an attached debugger if 0x641a6350 is written into 0x20001f88:

((unsigned long *)0xe0001050) = 0x20001f88; // address to trigger on
((unsigned long *)0xe0001030) = 0x641a6350; // optional value to trigger on
((unsigned long *)0xe0001038) = 0x3b06; // configuration to determine what access triggers the break

you’d need to configure 0xe0001038 for fetch access (the exact setting of the bits can be found in the ARM reference manual).

escherstair · November 18, 2025, 3:14pm

Hi @RAc

are you sure that M7 of NXP iM8M-Plus has DWT?

RAc · November 18, 2025, 3:17pm

Honestly, I don’t know, I’d have to check with the data sheet, but I’m fairly busy right now, apologies for the noise if it doesn’t.

qtprashleigh · November 18, 2025, 4:14pm

@escherstair thank you for your continued effort on this. The results are interesting, and IMO seems to point to some issue in the NXP code, though I’ve no ideas on how to debug it further (nor time, due to other priorities at the moment).

I do hope @MichalPrincNXP might be able to add something now that you’ve narrowed the issue down to a single function.

escherstair · November 24, 2025, 3:03pm

Hi @MichalPrincNXP can you share some thoughts on this, please?

MichalPrincNXP · November 25, 2025, 1:55pm

Hello, I am sorry for late response … I have counted with the help of my colleague in this topic, but he is out of the office longer than I have expected. Anyway, if I understand it correctly you are using ddr target, i.e. the MU1_M7_IRQHandler is placed in ddr, correct? I suspect that it could be a problem, it would be worth trying to use different memory to eliminate. I am going to involve i.mx8mp sdk owners, who know this SoC + board more and could comments. Please, be patient, I will get back to you soon, ok?

escherstair · November 26, 2025, 6:56am

Yes, but if I’m right @jbaum has all the code in the ITCM and only most of the data in DDR.

So, don’t focus too much on the DDR.

Based on what we (customers) found at the moment, only two different scenarios can fix the issue:

either

setting configUSE_16_BIT_TICKSto 1

or

using rpmsg over ThreadX (and not freertos)

Based on these two things I think that either rpmsg porting layer or freertos itself (or both of them) are responsible for the issue.

I cannot talk for other customers but for me is essential that you (and NXP) spends a lot of resources over this.

I can wait, but I need to be sure that you won’t forget about this.

MichalPrincNXP · November 28, 2025, 10:14am

I know you have been pointing to this part already. I thought the issue could come from other app. parts or it could be SoC-specific, as we have not been reported this kind of issue yet (rpmsg-lite has been introduced in 2016). I went through the rpmsg_env_freertos.c code again and focused on the queue handling functions because you are using rpmsg_queue.c and rpmsg_queue_rx_cb() that is putting newly received rpmsg messages into the freertos queue. Then, you are using rpmsg_queue_recv() API with timeout param outside the interrupt context to get newly received rpmsg messages from the freertos queue. (Hope it is used so.) As the issue symptoms are connected with ISR and timing, I would focus on the queue handling functions.

I am solving the resources internally to help with reproducing and analyzing the problem on the i.mx8 platform you are using, but may I ask you to try some rpmsg_env_freertos.c code adjustments on you side, please? Could you disable portEND_SWITCHING_ISR calls in env_get_queue() and env_put_queue() ? It should not cause any issue, only higher priority tasks that were woken by the ISR won’t run immediately and the context switch will be delayed until the next tick interrupt. I would try it to eliminate the case the issue comes from this code.
If there is no effect after that change I would focus on xQueueReceive use, esp. on the timeout_ms param. Maybe casting timeout_ms to (TickType_t). I also doubt about the uintptr_t use in api functions introduced due to ARMv8 aarch64 compatibility.

And, could you also recap how is it with compiler optimizations? What levels of optimization causes the issue?

Thank you.

escherstair · November 28, 2025, 2:00pm

Hi @MichalPrincNXP

thank you for your feedback.

To be precise I use rpmsg_queue_recv_nocopy() in the way you described.

I disabled portEND_SWITCHING_ISR calls in env_get_queue() and env_put_queue() but nothing changes. The application crashes in the same way, in the same time.

With -Os the crash is sistematic, after a time that depends on the value of configINITIAL_TICK_COUNT. With other compiler optmization, the application crashes too, but after different amount of time (and it seems to me that this time is not always the same, even with the same binary).

As I wrote above, I can have or fix the crash simply moving around MU1_M7_IRQHandler in the map file., leaving all the other symbols (functions and variables) in the same places.

jhan · November 28, 2025, 3:02pm

I just wanted to add in to this thread that I too am working on a project with the iMX8mp with FreeRTOS running on the M7 core and having the exact same issue. After around 37 hours of run time, the M7 firmware stops responding to the Linux side over rpmsg. For the last several weeks I assumed it was an issue with my specific application code, but after pouring over my code for weeks and now finding this thread, I believe it to be the same root cause (whatever that may be).

I have confirmed setting configUSE_16_BIT_TICKS to 1 allows the firmware to run past the ~37 hour mark. However, this is not a long term solution for my application so I will be following this thread. If I can do any testing or provide any more information that might help root cause this, please let me know.

MichalPrincNXP · December 1, 2025, 8:59pm

Thanks for clarifications and disabled portEND_SWITCHING_ISR calls testing. Before my colleagues are ready with reproducing on the discussed imx8 board (in process), may I ask you for another testing on your side? I have focused on uintptr_t use elimination in freertos porting layer this time and prepared an update on the temp_freertos_hangs_solving branch. It removes changes introduced with aarch64 support. Would it be possible to retest this code on your project, please?

Thank you.

MichalPrincNXP · December 1, 2025, 9:01pm

Thanks for reporting and confirming similar issue as discussed here. Hopefully, we will find the rootcause soon.

escherstair · December 2, 2025, 2:31pm

Hi @MichalPrincNXP

I tested your patches. In this way the application doesn’t crash (short-term test, not long-term test at the moment).

Bu, there is a big but:

with this patch, only one function changes: it’s .text.rpmsg_lite_remote_init that was 0x150 bytes large with uintptr_t and it’s 0x154 bytes large (so, it’s bigger) with uint32_t

The reason for this change in the generated assembly is not clear to me, but I played a little bit and it seems to me that it depends on how the macros RL_WORD_ALIGN_UP(a) and RL_WORD_ALIGN_DOWN(a) are implemented in the two cases.

But I don’t see why this should happen.

One thing that I notice is in the macro RL_WORD_ALIGN_UP(a) that I think should be

/*! @brief Align a value up to the next multiple of the word size */
#define RL_WORD_ALIGN_UP(a)
(((((uint32_t)(a)) & (RL_WORD_SIZE - 1U)) != 0U) ? ((((uint32_t)(a)) & (~(RL_WORD_SIZE - 1U))) + RL_WORD_SIZE) :
((uint32_t)(a)))

notice that I replaced the hardcoded 4U with RL_WORD_SIZE.

But this doesn’t change the bahavior.

The real reason why it doesn’t crash is that since .text.rpmsg_lite_remote_init increased its size, all the following functions from rpmsg_lite module (plus some other functions, until the first *fill* is appended) moved by 4 bytes in the map file.

Id I take my old firmware that crashes, add I add some nopso that rpmsg_lite_remote_init increases of 4 bytes, the firmware doens’t crash anymore.

So, this si the big but:

the firmware doens’t crash not because of the patch, but becasue the rpmsg_lite functions moved into the map file.

I attach two files crash.txt and no_crash.txt that are the generated assembly of rpmsg_lite module when I use your patch (no_crash.txt).

assembly.zip (15.9 KB)

Let me know what you need I test.

MichalPrincNXP · December 2, 2025, 9:38pm

Thanks. Do I understand it correctly that only rpmsg_lite_remote_init function disassembly changed? Other rpmsg-lite functions disassembly are kept unchanged when using my testing github repo branch with uintptr_t replacement?

escherstair · December 3, 2025, 7:14am

Not exactly.

rpmsg_lite_remote_init is the only function that changes its size (even if I don’t understand why your patch should produce this).

If you look to the assembly files that I uploaded inside the .zip in my previous message you see that some other functions change some fixed numbers loaded into registers with movw or movs operation. I suspect they could be the addresses of some objects (changed since the size of rpmsg_lite_remote_init has changed and so memory addresses of some objects has changed).

I think that we need the help of an assembly expert to uderstand why the function chnages in this way.

As far as I understand, in Cortex-M7 uint32_t and uintptr_t are the same data type (32 bit wide). And so I would expect that the macro RL_WORD_ALIGN_UP() should not change its generated assembly.

jhan · December 4, 2025, 1:35pm

Just as another data point, I replaced the rpmsg lib source in my project with the one in your branch with the uintptr_t change, and the M7 still stopped responding at the ~37 hour mark. So no change in my application.

Topic		Replies	Views
Link Register, LR, on GCC/Cortex-M3 seems to be corrupted before running the first task Kernel	20	2089	November 8, 2021
Massive stack overflow on task when kernel starts. Any ideas? Kernel debug	50	1116	May 29, 2024
FreeRTOS v11.1 R5F MPU Error Kernel	80	1232	March 1, 2025
Yield from ISR and Tick Interrupt Collision? Kernel	10	703	February 2, 2012
CAN message gets corrupt inside a FreeRTOS task Kernel	32	1961	June 8, 2021

FreeRTOS hangs - iMX8 CM7 RPMSG

Related topics