When I wrote about “crash” I mean “M7 core doesn’t behave as expected” (no more answers to rpmsg, no more FreeRTOS task switching).
I don’t know exactly what happened inside the core (I’ve never been able to take a trace as JBaum did).
But I’m going to play with configINITIAL_TICK_COUNT. If in this way I’ll be able to get the failure (I won’t call it crash anymore) as soon as I like, we’ll be sure that the issue is the same (I think we can agree with this).
I do see similarities between @escherstair s and my problem. Especially reading his description on the rpmsg-lite issue tracker. Our setup is quite similar with the same SoC, FreeRTOS, rpmsg-lite, one other ISR, one task (?), problem occurs after many hours, no indicator to what happened, …
From our side, I must report having switched to ThreadX, with which the issue does not seem to occur (multiple devices running for nearly a week now). I guess that the issue somehow stems from the combination of FreeRTOS and rpmsg-lite. Not necessarily one or the other. Unfortunately time does not permit for further root cause analysis at this point, and we have to cover the uncertainty with more extensive testing instead.
it’s really interesting that your investigation seems to point that -Os compiler flag is necessary to see the issue (-O2 seems fine).
When there is such a kind of situation, usually this means:
some wrong assumption in data alignment (that can be not respected with -Os
code define as inline is really inlined (i.e., copied and pasted where it’s called)
I tried to compare map files with different optimization flags.
This is the list of FreeRTOS and rpmsg function that are inlined with -Os and are normal functions with -O2:
prvIsQueueEmpty
prvCopyDataFromQueue
prvBytesInBuffer
prvInsertTimerInActiveList
prvReloadTimer
prvSampleTimeNow
rpmsg_lite_get_endpoint_from_addr
vq_ring_update_used
Do you see any issue if one of the above FreeRTOS functions is inlined (i.e., do not compiled as normal function but copied-and-pasted where it’s called)?
If switching to ThreadX is ok (extensive test needed) I think that the issue comes from either FreeRTOS or rpmsg_freertos code.
Since creating the initial post, we have found -O2 to also hang, although the connection with the tickcount being close to 2^27 is now gone and reproducability is worse. Optimization flag -O1 is still “fine”, but it takes just one counter example as it did with -O2..
Hello all, I am interested in this issue as it touches rpmsg-lite I am responsible for. @jbaum Would it be possible to share a minimalistic project that suffers from the described issue? I have limited possibilities in testing on an i.MX8 board, but maybe I could port/transfer that failing project to another multicore MCU to reproduce and solve. Thank you.
Hi Michal! Thank you for your offer. I will try to prepare such an example for you. Due to time constraints I would have to strip our project to its rpmsg related parts, rather than try and build one from the ground up. Meaning, there are relevant portions of our code that I can not reasonably strip further. I am only allowed to share that code privately with you however, so if you’d send me your e-mail address via PM, I can start.
Hi @jbaum , great, you can reach me via email, see my email in MichalPrincNXP profile in Github (I am not allowed to put link into this post unfortunately)
I just came across this thread while debugging exactly the same issue. I’m running FreeRTOS on the i.MX8MP CM7 with rpmsg-lite, and when the systick value reaches 0x0800 0000 the rpmsg task hangs. In my case, this causes a watchdog reset 8 seconds later.
Interestingly, there is one task in my application which has a higher priority than the rpmsg task and this one keeps running. Using that task, I can see that the rpmsg task and all lower-priority tasks are unresponsive, but the systick value is still incrementing.
I don’t have anything yet to add as far as root cause, but given the commonality between my issue and the one first reported by @jbaum I would assume there is some bug in the rpmsg-lite implementation. If I get any other clues I will share them here.
@MichalPrincNXP I put together a minimal example and sent it to your work mail. It still involves some heavier messaging logic that I could not remove however.
@qtprashleigh Could you share some details on the overall architecture of your project. As in, what tasks, interrupts and control and message flow? Maybe there is some setup that provokes this issue. Maybe some things can be ruled out.
One thought about these issues which happen at a tick value of 0x8000’0000 is that is where int values roll over and the unsignedness of tick values becomes important.
An inspection of the code to see that all the handling of tick values is done with correct arithmetic should be done.
Tick Values must always be stored as unsigned values, of the type TickType_t.
You can not use the normal comparison operators on Tick Values unless you explicitly handle the “epoch” change of rolling over from 0xFFFF’FFFF to 0x0000’0000, and MUST not let tick value of 0x8000’0000 or later be thought of a “negative” by getting an int promotion. (This is why there are two delay lists in the scheduler, one for this epoch, and one for the next, so that it can use comparisons to determine if something has passed)
The “best” and reliable method of testing for timeouts without having special handling for epochs (and assumes a epoch of time doesn’t happen, but you can change epochs) is to compute “now” - starting_time >= timeout_interval. The difference, since starting time is always in the past of now, will always be positive and thus as long as we can’t have lost an epoch, will give the right duration. Trying to compute starting_time + timeout_interval and seeing if we pass that point, will have issues, and this case is one of them.
The main task has priority 3 and handles the bulk of the work using GPIO, ECSPI, and I2C to interface to various external hardware. This task uses GPT6 for critical timing interrupts.
The rpmsg task has priority 2. Its only job is to wait for an rpmsg request from Linux and then respond with a buffer containing some operational status data which it gets from the main task. It services about three such requests per second.
The watchdog task has priority 4. It services the hardware watchdog every 5 seconds. Each of the other tasks must periodically refresh a health status bit, or else this task will allow the watchdog to expire.
I have found that when the tick timer exceeds 0x08000000, the rpmsg task and the main task both stop refreshing their health status, but the watchdog task keeps running. I also discovered that if I change the timeout in rpmsg_queue_recv() from 2000ms to 0, the issue didn’t occur as reliably (though it still happened sometimes).
As a workaround, I found that if I set configUSE_16_BIT_TICKS to 1 the issue appears to be resolved. My application doesn’t need to handle delays over 5 seconds, so a 16-bit tick works fine for me. I’m no closer to root cause, but at least I have a workaround for now. I will let it run for a few days to confirm that it works.
@richard-damon Thank you for your suggestions, but I’m not convinced this has anything to do with signedness. The troublesome value is 0x0800 0000, not 0x8000 0000. Apologies for missing the leading zero in my initial post.
But if I want to “promote” a TickTime_t (that can be 16, 32 or 64 bits wide) to uint64_t time, I should left shift the value so that it’s stored in the Most-Significative bits of the uint64_t.
Otherwise, after (TickTime_t)0xFFFF FFFFthere is (TickTime_t)0x0000 0000, but after (uint64_t)0x0000 0000 FFFF FFFF there is (uint64_t)0x0000 0001 0000 0000
as for xEventGroupClearBits() calling in env_init() … yes, that is intentional, xEventGroupClearBits() implementation in event_group.c is checking “the user is not attempting to clear the bits used by the kernel itself“, from the eventEVENT_BITS_CONTROL_BYTES macro we can see that the highest byte of EventBits_t (TickType_t) is utilized by the kernel and can’t be used for event bits.
As for env_get_timestamp, this obsolete function is not used in rpmsg_lite and we should think of removing from the rpmsg_lite code. The original intention was to used that for debugging purposes, to get a 64bit hw timer value. Then it has been moved into env. layer and implemented as the os tick timer read. I think the implementation with casting uint32_t to uint64_t is ok. I have also check the app. @jbaum provided and this this function is not used in the app. code and should not be the rootcause.
I am working on getting project provided by @jbaum running on another board that I have and to replicate the issue…
Starting from what @qtprashleigh wrote, I investigated inside what changes when configUSE_16_BIT_TICKSis set to 1.
Something changes, but it’s not so interesting for the issue IMHO.
But then I see in portmacro.h that
#elif ( configTICK_TYPE_WIDTH_IN_BITS == TICK_TYPE_WIDTH_32_BITS )
typedef uint32_t TickType_t;
#define portMAX_DELAY ( TickType_t ) 0xffffffffUL
/* 32-bit tick type on a 32-bit architecture, so reads of the tick count do
* not need to be guarded with a critical section. */
#define portTICK_TYPE_IS_ATOMIC 1
Is it possible that portTICK_TYPE_IS_ATOMIC set to 1 doesn’t handle in the proper way something?
Because it’s true that M7 for iMX8M-Plus is a 32-bit architecture, but the Cortex-A is 64-bit. And so I don’t know what is atomic and what it’s not.
I’m going to force this to 0 (with configUSE_16_BIT_TICKSset to 0 too - so 32-bit tick) to see what happens to the issue.
One key point is that an Cortex-A processor needs a different port than the Cortex-M7, so will have different a portmacro.h with different definitions.
Sorry for the confusion, but my application is running on Cortex-M7 (not on Cortex-A). I pointed out the Cortex-A because I’m not sure if a “classic” standalone M7 is different from the HMP.
Arm works hard to make anythng marked with a specific architecture follow the architecture rules, so I would be fairly sure that within the M7, it works atomically.
Now, if you share that with another processor, you will need to follow the cache-coherency rules defined by the combination, which is outside of FreeRTOS’s domain except for the SMP ports, but as they are different processors, it can’t be SMP.