The FreeRTOS tick counter (xTickCounts) stops incrementing in very rare cases

shinya · January 15, 2025, 2:13pm

I am running FreeRTOS in AMP configuration on each of two RPU(Cortex-R5)s of a Zynq UltraScale+ MPSoC.
On rare occasions, the tick counter (xTickCounts) of one of the two RPUs stops incrementing.
What could be the cause?
FreeRTOS version number is 10.0.0 provided by AMD Xilinx.
I have already confirmed the followings.

The tick counter stop incrementing but tasks are running and context switch works properly.
The “FreeRTOS_Tick_Handler” function itself is not broken because no write instruction is executed in the area where the interrupt handler “FreeRTOS_Tick_Handler” is stored.
XScuGic_ConfigTable.HandlerTable where the function pointer to “FreeRTOS_Tick_Handler” is stored is not broken.

richard-damon · January 15, 2025, 3:59pm

Are the tick interrupts still occurring? (My guess is they are not). Check with a breakpoint.

My first guess is that something has disabled the interrupt for the tick.

kstribrn · January 15, 2025, 4:28pm

I suspect your tick interrupt is not firing. If this is the case then you’ll still see a task running and event-based context switches will still work (example - unblocking a task through giving a semaphore, posting to a queue, etc). Time-based context switches (example: a task completing a delay) will not work.

As richard-damon has suggested you can check if the tick increment function is being called by a breakpoint. I’d also check your CPU to see what the interrupt configuration is when this starts occurring.

shinya · January 16, 2025, 12:28am

Thank you for your reply, richard-damon.
For debugging, I temporarily changed the interrupt source of tick from Triple Timer Counter to FPGA, prepared an interrupt count register in FPGA, and checked it.
The FPGA interrupt counter register was incremented, but xTickCounts was not. Therefore, it seems that “FreeRTOS_Tick_Handler” was not called even though the tick interrupt itself was received.

I have a limited number of JTAG debuggers, and this phenomenon occurs infrequently, and sometimes it is not reproduced even if I prepare 20 operating environments and run them continuously for 3 days…

What can be considered to disable the interrupt for the tick?
The CPU interrupt settings have not been changed since they were set via the driver at FreeRTOS startup. I have also confirmed that interrupts other than tick are operating normally.

shinya · January 16, 2025, 12:45am

Thank you for your reply, kstribrn.

As I told richard-damon, “FreeRTOS_Tick_Handler” seemed to be not called.
You are correct, and I first noticed this phenomenon when a task did not return from vTaskDelay.

The CPU interrupt configuration should not have been changed after FreeRTOS was started, but what specific configurations should be checked after the phenomenon occurs?

aggarg · January 16, 2025, 5:30am

Can you elaborate this please? Are you saying that the tick interrupt fires but the corresponding handler is not called?

shinya · January 16, 2025, 5:49am

Thank you for your reply, aggarg.
As you guessed, the tick interrupt signal is coming from the FPGA to the CPU, but the tick interrupt handler is not being called.
Other interrupt handlers seem to be called fine, so as richard-damon says, it’s possible that only the tick interrupt is being disabled by something.

aggarg · January 16, 2025, 6:11am

Is it possible to stop the system in this state and then use the debugger to examine if the interrupts are disabled? This would confirm the hypothesis and then you’d be able to focus on finding where the interrupt is getting disabled.

shinya · January 16, 2025, 9:25am

Thank you for your reply, aggarg.

There is a limited number of J-TAG debuggers, and I do not know whether this phenomenon can be reproduced in an operating environment with a debugger connected, so for the time being I will try to incorporate debug code to check the state of the interrupt mask register when the phenomenon occurs.

richard-damon · January 16, 2025, 2:37pm

I am not familiar with the R5 processor, but often the “Interrupt Controller” isn’t actually part of the “CPU” but a module attached to it, and that has a number of registers that control each of the interrupts going into it. Unless you read back those registers to confirm that no “wild write” hasn’t change it, you don’t know that it hasn’t changed.

shinya · January 17, 2025, 1:11am

Thank you for your reply, richard-damon.

“Wild write” means “unintentional write”, doesn’t it?
I’ve got it. As you has suggested, I will confirm the registers that control interrupts when the phenomenon occurs.

richard-damon · January 17, 2025, 3:00am

A “wild write” is a write through a pointer that points to the wrong address, or to an array with a bad index. Its harder to have a write that you didn’t intend to occur, but not hard to end up writing where you didn’t intend if pointers/indexes get messed up.

shinya · January 17, 2025, 4:34am

Thank you for your reply, richard-damon.
I’ve got it.

shinya · January 20, 2025, 3:12am

Referring to the AMD Xilinx Zynq UltraScale+ MPSoC TRM, Register Map, and ARM’s interrupt controller Manual, I added debug code to check whether the settings of the following registers, which seem to be related to the phenomenon, were different before and after the phenomenon occurred, and then performed a continuous operation test.

Interrupt Set-Enable Registers
Interrupt Priority Registers
Interrupt Priority Mask Registers
Interrupt Configuration Registers
Interrupt Set-Pending Registers

As a result, although the phenomenon itself occurred, there was no change in the register settings before and after the phenomenon.
Since I changed the source of the tick interrupt back to TTC (Triple Timer Counter), I also checked the TTC registers, but found no particular problems.

If there is anything else I should check, please let me know.

aggarg · January 20, 2025, 7:56am

That is strange as the ISR is not executing even when the interrupt is not masked. As it seems specific to the hardware, can you also reach out to Xilinx and ask them?

shinya · January 21, 2025, 7:11am

Thank you for your reply, aggarg.
I checked the AMD Xilinx support page to see if a similar problem had been reported, but I couldn’t find anything. I’m going to ask about this issue on the support page by myself.

shinya · January 29, 2025, 2:28am

@aggarg @richard-damon @kstribrn
I found that there was a problem with the interrupt controller settings when booting FreeRTOS.
Because two RPUs were started at roughly the same time, access to the interrupt controller was competing, resulting in an unintended setting.
By referring to this, I reviewed the interrupt controller settings themselves and added exclusive control to prevent access competition, and the problem no longer occurred.
Thank you for your advice. It was very helpful.

aggarg · January 29, 2025, 2:54am

Thank you for sharing your solution!