If setting portTICK_TYPE_IS_ATOMIC to 0 makes things worse, then there may be problems in you design with critical sections or interrupt control/design, as all that does is put accesses to the tick counter into a critical section.
Hi @richard-damon thanks!
Can you clarify a little bit what you have in your mind when you write âproblems in you design with critical sections or interrupt control/designâ?
Do you mean priorities?
Or wrong usage of xxxFromISR() functions?
Or something else?
It could be almost any of those, some what depends on what you mean by âcrashesâ. if the system stops responding because of hitting an assert, that is valuable information worth getting as that may pin point the problem.
Based on all the tests Iâve done and on my knowledge, it seems that no assert is triggeted.
As @jbaum and @qtprashleigh found out in their above posts (better than me) it seems that the freertos scheduler stops switching tasks.
I can confirm that this happens only if there is freertos + rpmsg + an external interrupt (for them itâs a GPT, for me itâs an ADC data ready interrupt). And so something with the interrupts is wrong if rpmsg + freertos is used.
I hope in the investigation from @MichalPrincNXP, but let me know if I can do something to help.
One thing that just came to my eyes:
in rpmsg_platform.c I see
/**
* platform_in_isr
*
* Return whether CPU is processing IRQ
*
* @return True for IRQ, false otherwise.
*
*/
int32_t platform_in_isr(void)
{
return (((SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) != 0UL) ? 1 : 0);
}
and this is based on ICSR register.
But as far as I know, to know if an instruction is running in interrupt handler or thread mode I should read IPSR (not ICSR).
And there is the function __get_IPSR() for this purpose.
Are you sure about ICSR?
Hi @escherstair , based on the research using both IPSR and ICSR should be ok. For FreeRTOS projects, one can use directly the xPortIsInsideInterrupt() function that is reading the IPSR register. Anyway, could you have a try to replace the platform_in_isr() impl. on your side to see the potential effect? I have not succeeded in reproducing the issue on my side yet, I have the working project but no crash observed yet, working on it âŚ
I did it yesterday. Nothing changes.
I can confirm what @qtprashleigh wrote: setting configUSE_16_BIT_TICKS to 1 doesnât âstop workingâ after 37 hours.
Iâll go on with my investigation.
@escherstair thanks. To avoid any optimization issue I would also try to use the xPortIsInsideInterrupt() directly in the rpmsg_env_freertos.c code, replacing the env_in_isr(). May I ask you to try on your side?
I can confirm what @qtprashleigh wrote: setting
configUSE_16_BIT_TICKSto 1 doesnât âstop workingâ after 37 hours.
Do I understand it correctly that when configUSE_16_BIT_TICKS is set to 1 the issue is no more observed?
Just done it.
I doesnât fix the issue but it takes a longer time to happen. So, it has an impact.
Yes. Correct.
I have had my board running continuously for over 12 days now since making this change, with no issues. So I can confirm that this workaround avoids the issue.
I did other tests. I need to double check (one more time) because I want to be 100% sure about this:
- I did test with
configUSE_16_BIT_TICKSset to 0 (so 32 bit TickCount) - Iâve been able to build an application that doesnât crash in 37 hours (Iâm not saying it wonât crash forever)
- I have another application that crashes in 37 hours (or sooner if I change
configINITIAL_TICK_COUNT) - the difference between them is small, and the function added (to get the crash) are never called during the execution
- so, it seems to me that only the map file (placing in memory of objects) changes between the two applications
But, give me some more time, because I want to be really 100% sure about what I wrote above. Other test scheduled from my side.
If @MichalPrincNXP can share some updates from his side it would be useful.
Interesting find! How large/complicated is the project? Would it be possible to share it/a derivation as minimal example to reproduce the issue? My project requires a propietary kernel driver, which makes things harder for Michal.
@escherstair great news, I am interested in what the issue comes from.
As @jbaum indicated, I was not much successful with transferring the provided minimalist app. into another hw and reproducing the issue. I got it working on rt1180 board using IAR compiler and facing task stucking in a blocked state after cca 20min, but I canât say this the porting issue or the particular problem all we are trying to solve. Anyway, I was observing that the task stuck happens only when MU and GPT interrupts coexist and interacts with the task (GPT posts event and MU puts new item into the rpmsg queue that the task is waiting for). Once I have updated the app. logic to avoid the GPT interrupts, no task stuck observed. Also, no issue observed when the rpmsg_queue_recv api is called with 0 timeout.
@jbaum @MichalPrincNXP the project itself itâs not so simple, but the crash happens even if it basically does nothing (I mean, read an external ADC and sending messages from M7 to A53). All the other tasks are not triggered.
The problem is that my application requires at leas an external ADC that must return the ADC samples. Long story short: a custom hardware is required.
Iâll think if I can simplify in some way the application to avoid the necessity of ADC.
Reading what youy wrote I think you got the point because I can confirm that the issue happens only if MU and (an external interrupt coexist (in my case this is ADC).
If I remove the ADC (physically or its app logic) no issue happens.
In my application I call rpmsg_queue_recv_nocopy() with a timeout that itâs not 0. Iâll change something to use 0. But I can say that when I had portMAX_DELAY the issue happened more often. So I decreased the value.
At that time I thought that if something (an interrupt?) happens while rpmsg is waiting, something else stays blocked.
This could explain why timeout 0 doesnât show the issue (rpmsg doesnât wait so nothing can happen while waiting).
Letâs stay in touch on this topic. It seems to me weâre not so far from the catch
I tested this workaround (with rpmsg_queue_recv_nocopy() in my case) and It doesnât work.
Calling with 0 timeout doesnât fix the issue.
At the moment the only effective workaround is setting configUSE_16_BIT_TICKS to 1.
I let you know.
After deeper testing I can confirm that the difference between a firmware that crashes and another one that doesnât crash itâs only in functions that are never called.
Iâm starting an investigation on object alignment in map file and/or firmware size to see what happens.