If setting portTICK_TYPE_IS_ATOMIC to 0 makes things worse, then there may be problems in you design with critical sections or interrupt control/design, as all that does is put accesses to the tick counter into a critical section.
Hi @richard-damon thanks!
Can you clarify a little bit what you have in your mind when you write âproblems in you design with critical sections or interrupt control/designâ?
Do you mean priorities?
Or wrong usage of xxxFromISR() functions?
Or something else?
It could be almost any of those, some what depends on what you mean by âcrashesâ. if the system stops responding because of hitting an assert, that is valuable information worth getting as that may pin point the problem.
Based on all the tests Iâve done and on my knowledge, it seems that no assert is triggeted.
As @jbaum and @qtprashleigh found out in their above posts (better than me) it seems that the freertos scheduler stops switching tasks.
I can confirm that this happens only if there is freertos + rpmsg + an external interrupt (for them itâs a GPT, for me itâs an ADC data ready interrupt). And so something with the interrupts is wrong if rpmsg + freertos is used.
I hope in the investigation from @MichalPrincNXP, but let me know if I can do something to help.
One thing that just came to my eyes:
in rpmsg_platform.c I see
/**
* platform_in_isr
*
* Return whether CPU is processing IRQ
*
* @return True for IRQ, false otherwise.
*
*/
int32_t platform_in_isr(void)
{
return (((SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) != 0UL) ? 1 : 0);
}
and this is based on ICSR register.
But as far as I know, to know if an instruction is running in interrupt handler or thread mode I should read IPSR (not ICSR).
And there is the function __get_IPSR() for this purpose.
Are you sure about ICSR?
Hi @escherstair , based on the research using both IPSR and ICSR should be ok. For FreeRTOS projects, one can use directly the xPortIsInsideInterrupt() function that is reading the IPSR register. Anyway, could you have a try to replace the platform_in_isr() impl. on your side to see the potential effect? I have not succeeded in reproducing the issue on my side yet, I have the working project but no crash observed yet, working on it âŚ
I did it yesterday. Nothing changes.
I can confirm what @qtprashleigh wrote: setting configUSE_16_BIT_TICKS to 1 doesnât âstop workingâ after 37 hours.
Iâll go on with my investigation.
@escherstair thanks. To avoid any optimization issue I would also try to use the xPortIsInsideInterrupt() directly in the rpmsg_env_freertos.c code, replacing the env_in_isr(). May I ask you to try on your side?
I can confirm what @qtprashleigh wrote: setting
configUSE_16_BIT_TICKSto 1 doesnât âstop workingâ after 37 hours.
Do I understand it correctly that when configUSE_16_BIT_TICKS is set to 1 the issue is no more observed?
Just done it.
I doesnât fix the issue but it takes a longer time to happen. So, it has an impact.
Yes. Correct.
I have had my board running continuously for over 12 days now since making this change, with no issues. So I can confirm that this workaround avoids the issue.
I did other tests. I need to double check (one more time) because I want to be 100% sure about this:
- I did test with
configUSE_16_BIT_TICKSset to 0 (so 32 bit TickCount) - Iâve been able to build an application that doesnât crash in 37 hours (Iâm not saying it wonât crash forever)
- I have another application that crashes in 37 hours (or sooner if I change
configINITIAL_TICK_COUNT) - the difference between them is small, and the function added (to get the crash) are never called during the execution
- so, it seems to me that only the map file (placing in memory of objects) changes between the two applications
But, give me some more time, because I want to be really 100% sure about what I wrote above. Other test scheduled from my side.
If @MichalPrincNXP can share some updates from his side it would be useful.
Interesting find! How large/complicated is the project? Would it be possible to share it/a derivation as minimal example to reproduce the issue? My project requires a propietary kernel driver, which makes things harder for Michal.
@escherstair great news, I am interested in what the issue comes from.
As @jbaum indicated, I was not much successful with transferring the provided minimalist app. into another hw and reproducing the issue. I got it working on rt1180 board using IAR compiler and facing task stucking in a blocked state after cca 20min, but I canât say this the porting issue or the particular problem all we are trying to solve. Anyway, I was observing that the task stuck happens only when MU and GPT interrupts coexist and interacts with the task (GPT posts event and MU puts new item into the rpmsg queue that the task is waiting for). Once I have updated the app. logic to avoid the GPT interrupts, no task stuck observed. Also, no issue observed when the rpmsg_queue_recv api is called with 0 timeout.
@jbaum @MichalPrincNXP the project itself itâs not so simple, but the crash happens even if it basically does nothing (I mean, read an external ADC and sending messages from M7 to A53). All the other tasks are not triggered.
The problem is that my application requires at leas an external ADC that must return the ADC samples. Long story short: a custom hardware is required.
Iâll think if I can simplify in some way the application to avoid the necessity of ADC.
Reading what youy wrote I think you got the point because I can confirm that the issue happens only if MU and (an external interrupt coexist (in my case this is ADC).
If I remove the ADC (physically or its app logic) no issue happens.
In my application I call rpmsg_queue_recv_nocopy() with a timeout that itâs not 0. Iâll change something to use 0. But I can say that when I had portMAX_DELAY the issue happened more often. So I decreased the value.
At that time I thought that if something (an interrupt?) happens while rpmsg is waiting, something else stays blocked.
This could explain why timeout 0 doesnât show the issue (rpmsg doesnât wait so nothing can happen while waiting).
Letâs stay in touch on this topic. It seems to me weâre not so far from the catch
I tested this workaround (with rpmsg_queue_recv_nocopy() in my case) and It doesnât work.
Calling with 0 timeout doesnât fix the issue.
At the moment the only effective workaround is setting configUSE_16_BIT_TICKS to 1.
I let you know.
After deeper testing I can confirm that the difference between a firmware that crashes and another one that doesnât crash itâs only in functions that are never called.
Iâm starting an investigation on object alignment in map file and/or firmware size to see what happens.
Hello @escherstair I am curious if you have any new findings? From your latest posts it seems the issue comes rather from gcc objects alignment, so I have given up my effort for now. Thanks.
Hello @MichalPrincNXP
my big effort on this investigation goes on.
Unfortunately the alignment alone is not a factor, since I have to apps with the same aligment (I mean, the last byte of every object - function and data - is the same in the two apps).
But one of them crashes and the other one doesnât.
In the past I had exactly the same issue on another platform where the issue is a silicon errata related to delay cycles needed to have the instruction cache mechanism work properly.
In that case, when functions changed their position in the flash, sometimes a cache miss was triggered and the silicon errata didnât handle this cache invalidation properly.
I really appreciate if you could go on with your investigation.
Hello @escherstair , if cache is mentioned, have you tried to disable it for all used memories on both master and remote sides + shmem to eliminated this rootcause possibility (I guess you did it, just confirming)?
As for the investigation on my side, it does not make much sense to continue with my rt1180/IAR setup, I would rather switch to an IMX8 board with Linux, but my colleague who could help with that is out of the office these days. I am not able to promise an effort on that this week, unfortunately.
Regards
Michal
No, I didnât do this because Linux is out of my control and so I donât know if (and how) I can do this.
Iâm not saying itâs the cache. I said that in another situation with similar symptoms, it was a silicon errata, related to cache handling.
And in that case the workaround came from the silicon vendor.