If MU1_M7_IRQHandler() is placed at 0x80010908I don’t see the crash.
If the linker places it at 0x800109d8 (i.e. 208 bytes after) I see the crash. And the time for the crash depends on the value of configINITIAL_TICK_COUNT
Have you checked the alignment requirements of the interrupt table on that processor? At a quick search, the generic requirement for the CM7 is 32 bytes (0x20), but your shift doesn’t meet that requirement.
The problem is that the interrupt vector relocation register doesn’t implement all bits, but leaves some lower ones forced to be zero.
A lot of addresses crash. It seems that 0x80010908 it’s a lucky one (and it doesn’t crash). But I cannot say more because this kind of test require a lot of time (and few of us seems interested in it - even if the issue seems not to be in our customer code).
I don’t know how to do this.
The vector table itself is at 0x80000000. Is there any maximum distance between the vector table itself and the address of the several interrupt handler?
Or for the jump between the item in the vector table and the interrupt handler itself?
Mybe it’s not clear what I did: I’ve been lucky and I had two different applications. One of them crashes and the other one doesn’t crash. The difference in my source code is small. And so I started comparing the two map files. The two address are the ones where the linker placed the handler automatically.
I played with the nop() only to forcing the linker moving all the other functions around.
Hm ok, I am not 100% sure if the DWT of the M7 is fully backward compatible with the M4, but on the M4 you would use the DWT and do something like this - this code fragment breaks into an attached debugger if 0x641a6350 is written into 0x20001f88:
((unsigned long *)0xe0001050) = 0x20001f88; // address to trigger on
((unsigned long *)0xe0001030) = 0x641a6350; // optional value to trigger on
((unsigned long *)0xe0001038) = 0x3b06; // configuration to determine what access triggers the break
you’d need to configure 0xe0001038 for fetch access (the exact setting of the bits can be found in the ARM reference manual).
@escherstair thank you for your continued effort on this. The results are interesting, and IMO seems to point to some issue in the NXP code, though I’ve no ideas on how to debug it further (nor time, due to other priorities at the moment).
I do hope @MichalPrincNXP might be able to add something now that you’ve narrowed the issue down to a single function.
Hello, I am sorry for late response … I have counted with the help of my colleague in this topic, but he is out of the office longer than I have expected. Anyway, if I understand it correctly you are using ddr target, i.e. the MU1_M7_IRQHandler is placed in ddr, correct? I suspect that it could be a problem, it would be worth trying to use different memory to eliminate. I am going to involve i.mx8mp sdk owners, who know this SoC + board more and could comments. Please, be patient, I will get back to you soon, ok?
I know you have been pointing to this part already. I thought the issue could come from other app. parts or it could be SoC-specific, as we have not been reported this kind of issue yet (rpmsg-lite has been introduced in 2016). I went through the rpmsg_env_freertos.c code again and focused on the queue handling functions because you are using rpmsg_queue.c and rpmsg_queue_rx_cb() that is putting newly received rpmsg messages into the freertos queue. Then, you are using rpmsg_queue_recv() API with timeout param outside the interrupt context to get newly received rpmsg messages from the freertos queue. (Hope it is used so.) As the issue symptoms are connected with ISR and timing, I would focus on the queue handling functions.
I am solving the resources internally to help with reproducing and analyzing the problem on the i.mx8 platform you are using, but may I ask you to try some rpmsg_env_freertos.c code adjustments on you side, please? Could you disable portEND_SWITCHING_ISR calls in env_get_queue() and env_put_queue() ? It should not cause any issue, only higher priority tasks that were woken by the ISR won’t run immediately and the context switch will be delayed until the next tick interrupt. I would try it to eliminate the case the issue comes from this code.
If there is no effect after that change I would focus on xQueueReceive use, esp. on the timeout_ms param. Maybe casting timeout_ms to (TickType_t). I also doubt about the uintptr_t use in api functions introduced due to ARMv8 aarch64 compatibility.
And, could you also recap how is it with compiler optimizations? What levels of optimization causes the issue?
To be precise I use rpmsg_queue_recv_nocopy() in the way you described.
I disabled portEND_SWITCHING_ISR calls in env_get_queue() and env_put_queue() but nothing changes. The application crashes in the same way, in the same time.
With -Os the crash is sistematic, after a time that depends on the value of configINITIAL_TICK_COUNT. With other compiler optmization, the application crashes too, but after different amount of time (and it seems to me that this time is not always the same, even with the same binary).
As I wrote above, I can have or fix the crash simply moving around MU1_M7_IRQHandler in the map file., leaving all the other symbols (functions and variables) in the same places.
I just wanted to add in to this thread that I too am working on a project with the iMX8mp with FreeRTOS running on the M7 core and having the exact same issue. After around 37 hours of run time, the M7 firmware stops responding to the Linux side over rpmsg. For the last several weeks I assumed it was an issue with my specific application code, but after pouring over my code for weeks and now finding this thread, I believe it to be the same root cause (whatever that may be).
I have confirmed setting configUSE_16_BIT_TICKS to 1 allows the firmware to run past the ~37 hour mark. However, this is not a long term solution for my application so I will be following this thread. If I can do any testing or provide any more information that might help root cause this, please let me know.
Thanks for clarifications and disabled portEND_SWITCHING_ISR calls testing. Before my colleagues are ready with reproducing on the discussed imx8 board (in process), may I ask you for another testing on your side? I have focused on uintptr_t use elimination in freertos porting layer this time and prepared an update on the temp_freertos_hangs_solving branch. It removes changes introduced with aarch64 support. Would it be possible to retest this code on your project, please?
I tested your patches. In this way the application doesn’t crash (short-term test, not long-term test at the moment).
Bu, there is a big but:
with this patch, only one function changes: it’s .text.rpmsg_lite_remote_init that was 0x150 bytes large with uintptr_t and it’s 0x154 bytes large (so, it’s bigger) with uint32_t
The reason for this change in the generated assembly is not clear to me, but I played a little bit and it seems to me that it depends on how the macros RL_WORD_ALIGN_UP(a) and RL_WORD_ALIGN_DOWN(a) are implemented in the two cases.
But I don’t see why this should happen.
One thing that I notice is in the macro RL_WORD_ALIGN_UP(a) that I think should be
/*! @brief Align a value up to the next multiple of the word size */ #define RL_WORD_ALIGN_UP(a) (((((uint32_t)(a)) & (RL_WORD_SIZE - 1U)) != 0U) ? ((((uint32_t)(a)) & (~(RL_WORD_SIZE - 1U))) + RL_WORD_SIZE) : ((uint32_t)(a)))
notice that I replaced the hardcoded 4U with RL_WORD_SIZE.
But this doesn’t change the bahavior.
The real reason why it doesn’t crash is that since .text.rpmsg_lite_remote_init increased its size, all the following functions from rpmsg_lite module (plus some other functions, until the first *fill* is appended) moved by 4 bytes in the map file.
Id I take my old firmware that crashes, add I add some nopso that rpmsg_lite_remote_init increases of 4 bytes, the firmware doens’t crash anymore.
So, this si the big but:
the firmware doens’t crash not because of the patch, but becasue the rpmsg_lite functions moved into the map file.
I attach two files crash.txt and no_crash.txt that are the generated assembly of rpmsg_lite module when I use your patch (no_crash.txt).
Thanks. Do I understand it correctly that only rpmsg_lite_remote_init function disassembly changed? Other rpmsg-lite functions disassembly are kept unchanged when using my testing github repo branch with uintptr_t replacement?
rpmsg_lite_remote_init is the only function that changes its size (even if I don’t understand why your patch should produce this).
If you look to the assembly files that I uploaded inside the .zip in my previous message you see that some other functions change some fixed numbers loaded into registers with movw or movs operation. I suspect they could be the addresses of some objects (changed since the size of rpmsg_lite_remote_init has changed and so memory addresses of some objects has changed).
I think that we need the help of an assembly expert to uderstand why the function chnages in this way.
As far as I understand, in Cortex-M7 uint32_t and uintptr_t are the same data type (32 bit wide). And so I would expect that the macro RL_WORD_ALIGN_UP() should not change its generated assembly.
Just as another data point, I replaced the rpmsg lib source in my project with the one in your branch with the uintptr_t change, and the M7 still stopped responding at the ~37 hour mark. So no change in my application.