Thanks for sharing your observations, @escherstair , @jhan . So, you confirmed that the uintptr_t use does not cause the issue, that’s good. Let me focus on provided disassembly comparison …
@jhan what is the memory address of MU1_M7_IRQHandler in your application with and without the uintptr_t patch?
What is the assembly of rpmsg_lite_remote_init in the two cases?
Since in your case you don’t see differences, this mean that:
- either the assembly doesn’t change. But in this case, whu?
- or the change of assembly doesn’t matter (but I would expect that in this case the address of
MU1_M7_IRQHandlerdoesn’t change).
And so, I think that we have another signal that, for some reason, MU1_M7_IRQHandler cannot be moved…
Seems I can’t upload attachments so I can’t post the disassembly . The assembly of the rpmsg_lite_remote_init function changed with adding the patched version, however the address of MU1_M7_IRQHandler did not change.
You should now be able to upload attachment.
do you have any news on this?
After what @jhan wrote, I would not focus too much on assenbly comparison.
The trick seems to me moving MU1_M7_IRQHandler around; and this behavior makes me wonder about wrong/incomplete cache handling when coming in and out from interrupt handlers and freertos scheduler
@escherstair could you double check that all your ISRs are ending with dsb instruction (__DSB() )?
It is present in MU1_M7_IRQHandler overloaded implementation in rpmsg-lite\lib\rpmsg_lite\porting\platform\imx8mp_m7\rpmsg_platform.c. Please also double check that #if (defined __CORTEX_M) && ((__CORTEX_M == 4U) || (__CORTEX_M == 7U)) is correctly evaluated and the dsb instruction is really a part of the MU1_M7_IRQHandler?
As the following post describes, missing that could cause app “crashed once every few days for unknown reasons“, so similar symptoms are here.
Please, also check all your other ISRs that are part of your app. Thanks.
@MichalPrincNXP already checked multiple times.
Nothing new.
__DSB() is there and the macro is correctly evaluated (for all the interrupts that I wrote the IRQ handlers by myself).
For all the other ones used in the SDK, it’s up to NXP double check….
Sorry but this si the fact.
Thanks for the confirmation, it does not make sense to check ISRs that are not used in an app.
I will point out that just because the vender should have done it right, doesn’t mean that when you use their code you don’t need to check it too. Your customer won’t accept that excuse when YOUR product doesn’t work.
sorry for the misunderstanding on my answer (English is not my native language).
I agree 100% to you and I can assure that in the last year I’ve been spending a lot of time analyzing the source code from NXP and freeRTOS trying to find the reason for the crash.
All the IRQ handlers seems ok to me (ever the one I took as they are from NXP SDK).
What frustrates me (and I think the other guys too) is that when I started reporting the issue, the answer from NXP and FreeRTOS has been “it’s not our code. You must be wrong somewhere”, without any deeper investigation.
But now, in this thread, I see a real effort on the topic and I really appreciate this.
@escherstair , you have mentioned another “silicon errata related to delay cycles needed to have the instruction cache mechanism work properly” … what SoC is that about? The symptoms are very similar. I went through the IMX8MP_1P33A errata and do not see any cache-related item. I am not familiar with that domain but it could be the issue.
Another point that comes to my mind and what could be worthtrying is to replace the rpmsg_queue_recv_nocopy() by rpmsg_queue_recv(), which should not be much complicated change and we could eliminate any data coherency issues when accessing shared memory from the application.
In the other case, the SoC is PIC32MZ1024EFK100 from Mirochip (it’s not a Cortex-M).
The silicon errata is the number 38 in this document.
In that case the crash happened while responding to ethernet packets (pings too).
I tried everything (RAM testing, flash testing, corruption testing, temperature testing, …) with no help. Then, after I demonstrated that changing the position of some ethernet-related functions created or fixed the issue I concentrate my investigation on caching.
And I found the silicon errata.
With the suggested workaround I was able to fix the issue, even when the functions are placed at the “offending” addresses.
I’m not saying this is the same case. I only share my past experience.
Do you think about accessing from Cortex-A and Cortex-M or accessing from different tasks in Cortex-M?
I meant to do that change on the M7 side, instead of calling rpmsg_queue_recv_nocopy() from a task and then releasing the rpmsg buffer by calling rpmsg_queue_nocopy_free() later, allocate an application buffer that is filled by received payload during the the rpmsg_queue_recv() function execution. There is no need to release the rpmsg buffer by calling rpmsg_queue_nocopy_free() anymore because this is done internally in the rpmsg_queue_nocopy_free() impl. It adds another data copying (from shared memory to an app. buffer) but the rpmsg buffers management is more accurate and we could avoid potential issues with incorrect buffer releasing from the app. level.
@MichalPrincNXP using rpmsg_queue_recv() instead of rpmsg_queue_recv_nocopy() (and adding some nop to keep the addresses in the map file) doesn’t fix the issue.
So, addresses in map file play a role in this issue (I think this is quite confirmed), and another thing (much more strange) is that the value of configINITIAL_TICK_COUNT plays a role too (making the crash happening soon or after 37 hours).
Hello @escherstair , thanks for trying to use rpmsg_queue_recv() instead of rpmsg_queue_recv_nocopy()and confirming that the issue does not come from this code part.
Recently, I have run across this conversation: RT1170: Add 32-byte alignment attribute to SystemInit by andrewleech · Pull Request #1 · nxp-mcuxpresso/mcux-devices-rt
There are some similarities (CM7 core, cache enabled, larger app) and maybe it could be worth trying to apply the proposed change, i.e. the SystemInit alignment in your project to see the possible effect. May I ask you to do so, please? Thank you.
thank you for your investigation.
I read the topic very carefully and I think that the symptoms seems very very similar to what I see on iMX8MP - I added what I see (in bold):
- large firmware (in my case it’s 3MB)
- The failure pattern was deterministic: specific firmware builds would fail to boot (to run) 100% of the time, while making seemingly unrelated code changes would produce builds that worked 100% reliably. Subsequent changes could reintroduce failures in new builds.
- Investigation through binary search and build comparison revealed the issue correlates with SystemInit (rpmsg functions) function placement.
- The same binary always fails or always works—this is not an intermittent runtime issue but a build-layout-dependent failure.
But I don’t think that SystemInit() plays a role in this issue for iMX8MP, since I don’t have any boot issue (and SystemInit() is called at boot).
I think there is something related to the cache and MU1_M7_IRQHandler() or some other rpmsg function (maybe). Or the distance in the flash between MU1_M7_IRQHandler()and some other instruction. But which instruction? Something related to FreeRTOS, since configINITIAL_TICK_COUNT is involved too?
I did a quick test and I added the aligment for SystemInit() but this shifts a lot of functions in the map file. And I so I would need another firmware to isolate the alignment to what we should test. And this requires time.
I’ll do only if it’s absolutely necessary. Let me know @MichalPrincNXP
I think that the cache issue (for RT1170 and, probably, iMX8MP) should be heavily investigated by NXP hardware engineers.
Looking to Errata 1259864 here I wonder if this could be the case if this happens on the memory where FreeRTOS places xTickCount variable if it’s 32-bit wide. And if an old/wrong value is used, the several FreeRTOS time-related functions don’t work as expected (handling wrong time values).
Looking to my map files, the following functions are 32-byte aligned when the firmware works and they’re not when the firmware crashes:
env_wait_for_link_upenv_tx_callbackenv_wmbenv_sleep_msec
This is probably not relevant, but as you’ve mentioned Cortex M7 and cache, I thought I’d mention it.
Our Cortex M7 application uses the MMU to flag some areas of memory that are used for DMA buffers as non-cacheable. The Cortex M7 does not permit unaligned accesses to non-cacheable memory. Because of this we had to (a) compile our program with the gcc -mno-unaligned-access option, and (b) replace the newlib version of memcpy with out own, because the version of newlib distributed for Cortex M7 isn’t compiled with that option and the code generated by memcpy performs unaligned accesses. Out of caution we used our own versions of memset and memcmp too.
If you use DMA to write into cacheable areas of memory, there are some things to watch for:
(a) Older versions of the Cortex M7 core have a bug which causes using write-back caching to give rise to data corruption in some situations. That’s why we only DMA to/from non-cacheable memory in our app.
(b) You must make sure that areas of memory that you DMA into do not share cache lines with unrelated areas that the processor writes to; otherwise a cache line that the processor has written into may be written back to memory, overwriting the data that the DMA controller wrote to.
Thanks @dc42
The code that I wrote doesn’t use DMA.
All the DDR memory is configured as write-back caching.
Do you think that is possible that M7 core of iMX8MP is one of the “older versions of the Cortex M7 core” and so it has the bug with write-back caching?
I just checked, and I was wrong about the bug. It applies to the use of write-through caching, not write-back caching. It’s documented at https://documentation-service.arm.com/static/665dff778ad83c4754308908?token=.