Stuck in xTaskResumeAll() - Interrupt Priorities? Stack?

Hi everyone,

I just signed up for this forum because I’m facing a problem on our system in development. We’re using OpenRTOS on different other platforms and now are setting up a new board using FreeRTOS (at the moment) on a Xilinx Zynq UltraScale+ (XCZU2CG). We’re using only one R5 core at the moment.

The system works fine on the bench, but when we put the board in one of our instruments there is quite a large chance that it will hang during the boot procedure (about 50%). When it hangs it is trapped in the while loop in function xTaskResumeAll(), which looks like this:

while( listLIST_IS_EMPTY( &xPendingReadyList ) == pdFALSE )

I read (here in the forum and elsewhere) that this could be related to interrupt priorities (configuration) or stack issues. Actually, the interrupts were one of my first guesses. This is because we’ll see quite more interrupts on the real instrument than on the bench due to lack off different other hardware.

I increased all tasks/BSP stack setting by quite a large amount without success. I checked the stack (we’ve a function printing all stack high/max values on console) when the instrument is running, and all have enough stack left. So, I assume this is not stack related, even though I had a moment during debugging where I thought it could be the stack:

I wanted to debug the xTaskResumeAll() function and found, that my debugger showed pxTCB to be pointing to 0x14 (or other very low addresses in the vector area), so all values in the TCB where nonsense of course. That’s why I thought about a stack issue first, but what was strange, the pointer was pointing in the same area when the system booted okay and ran for hours. So I suspected some kind of optimization or debugger related artifact. By chance I read in the comment that listREMOVE_ITEM() is just an optimized version of uxListRemove(). Since it looked like removing the entry from the list seems to be the issue I wanted to see what was going on in listREMOVE_ITEM() and where the pointer (pxItemToRemove) points. But since this is a macro this wasn’t possible so I decided to replace these calls by uxListRemove() and debug this function instead which does exactly the same (beside the return value) I guess.

Well, what can I say, after these changes the problem vanished. I started the instrument around 50 times without hanging at all. And the pointer pxTCB became valid now by the way. During these tests I moved back to the original code twice and each time the first boot after that failed.

My concern is that this whole thing is timing related because these code changes will change the timing of the CPU in relation to all external periphery (e. g. external IRQs). Beside from this I have two questions, and this is why I’m writing this post:

  • Any idea why replacing listREMOVE_ITEM() by uxListRemove() should fix this issue, beside the timing?
  • What does the correct interrupt priority configuration look like?

About the second question, all our interrupt are initialized to priority 0xA0 which is the default value, we didn’t change any of them (wrote a debugger script to read all priority registers to double check this).

FreeRTOSConfig.h includes the following related config values:

#define configUNIQUE_INTERRUPT_PRIORITIES 32
#define configMAX_API_CALL_INTERRUPT_PRIORITY (18)

And the priority shift is set to 3 (which matches the hardware). I can see that in critical sections the mask is set to 0x90 which matches 18<<3. But I’m not sure if I have to take into account some other setting and how the CPU interrupt priorities should be configured. Somewhere I read that the tick timer interrupt should have the lowest priority, I gave it a try without any changes.

I know this post has become quite long, so, thanks for reading!

Just a few more information:

FreeRTOS version is 202107.00 (from Xilinx vitis), I believe this is 10.4.4
Compiler used is armr5-none-eabi-g++ 10.2 (Xilinx Vitis 2021.2)

Quite funny, I debugged this issue for about 3 days now and read a lot as well.

Right after posting my question I ran into this:

I tried the -fno-strict-aliasing option and after ~50 reboots the problem didn’t show up anymore. So I believe this helps, still not sure if it is the right thing to change.

Using “-fno-strict-aliasing” option is right change as indicated by the below comment from Richard in the other thread.

" We are currently testing -Os using arm-none-eabi-gcc version “(GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release)” and the tests all pass. If it is an aliasing problem, as would be indicated by the suggestion to use -fno-strict-aliasing, then you could also try setting configUSE_MINI_LIST_ITEM to 0 - although I think that option is only available in the git mainline at the moment. We introduced configUSE_MINI_LIST_ITEM in response to other forum posts about aliasing compiler warnings."