Still hitting null pxCurrentTCB when handling interrupts

milhead · November 26, 2024, 4:52pm

Hello Forum Folks!

Using FreeRTOS on an STM32H573 no TMZ, non-secure

I am still receiving frequent occurrences of Null or invalid pxCurrenbtTCB pointers in FreeRTOS when calling ISR approved functions. Most of the literature and forum posts point to the ISR priority setup as a culprit but I believe that I am configured correctly.

[ 3] Heartbeat::heartbeatTask: __NVIC_PRIO_BITS: 4
[ 3] , configPRIO_BITS: 4
[ 3] , configLIBRARY_LOWEST_INTERRUPT_PRIORITY: 15
[ 3] , configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY: 5
[ 3] , configKERNEL_INTERRUPT_PRIORITY: 240
[ 3] , configMAX_SYSCALL_INTERRUPT_PRIORITY: 80

In the ever to frequent stack trace below I hit an assert I added to FreeRTOS tasks.c (where I validate that pxCurrentTCB is in valid ram on every dereference.

I am using the following ST HAL call
/* FDCAN1 IT0 interrupt Init */
HAL_NVIC_SetPriority(FDCAN1_IT0_IRQn, 5, 0);
To set the interrupt priority and am a bit confused I have tried to use
configMAX_SYSCALL_INTERRUPT_PRIORITY+5 as the parameter as well but gets the same result

I’ve used FreeRTOS for years on several processors without ever being haunted by this persistant of a system failure and am sure it’s in my setup somewhere.

xuelix · November 27, 2024, 1:06am

Have you turned on (defined) configASSERT()? That might be able to capture issues regarding interrupt priority configuration and what might cause the memory corruption.

aggarg · November 27, 2024, 5:45am

Can you check the implementation of HAL_NVIC_SetPriority to ensure that it shifts the value correct to account for __NVIC_PRIO_BITS? Also, as @xuelix suggested, please define configASSERT.

milhead · November 27, 2024, 8:09pm

I do have config assert mapped to my assert implementation. The FreeRTOS distro does not check all dereferences of pxCurrentTCB so it’s hitting an assert I have added that verifies that a pointer points to a valid RAM address.

aggarg · November 28, 2024, 6:15am

It could it be a case of memory overrun then. Can you declare a variable right next to pxCurrentTCB and check its value when your assert fires:

portDONT_DISCARD PRIVILEGED_DATA TCB_t * volatile pxUnused = NULL;

If you see this variable pxUnused is modified, then you can use data breakpoint to catch when the corruption happens.

milhead · November 29, 2024, 7:11pm

Hello Aggarg,

__NVIC_PRIO_BITS is set to 4,., Using all of the defines in my first post the bottom of the ST hal uses the code “”

NVIC->IPR[((uint32_t)IRQn)] = (uint8_t)((priority << (8U - __NVIC_PRIO_BITS)) & (uint32_t)0xFFUL);

After the line executes I wind up with

NVIC_IPR_13 holds 0x5000000

which sets Priority if Interrupt 55 to 0x50 but since the __NVIC_PRIO_BITS is 4 the actual value of the byte register os 80 which I think is an ok priority ok executing FreeRTOS IDR-ready Metghods.

milhead · November 30, 2024, 2:46pm

With many of the pxCurrentTCB dereferences in place I now assert in place of hard-faulting in many instances. Below is an instance of an invalid TCB pointer.

note: Valid Ram is between 0x20000000 and 0x2009FFFF

Thread 2 hit Breakpoint 2, _assert_failed (assertion=0x8127004 "(((uint32_t)(pxCurrentTCB) >= 0x20000000) && ((uint32_t)(pxCurrentTCB) <= 0x2009FFFF))",
    file=0x8126f14 "/home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c", line=3033) at /home2/miller/src/vip-rcip/vip/libs/assert/assert.c:123
123         if (! isAssertAlreadyInProgress)
(gdb) bt
#0  _assert_failed (assertion=0x8127004 "(((uint32_t)(pxCurrentTCB) >= 0x20000000) && ((uint32_t)(pxCurrentTCB) <= 0x2009FFFF))",
    file=0x8126f14 "/home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c", line=3033) at /home2/miller/src/vip-rcip/vip/libs/assert/assert.c:123
#1  0x080a0f8c in vTaskSwitchContext () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c:3033
#2  0x080a3f48 in PendSV_Handler () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/portable/GCC/ARM_CM33_NTZ/non_secure/portasm.c:236
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up
#1  0x080a0f8c in vTaskSwitchContext () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c:3033
3033            configASSERT( isValidRAM(pxCurrentTCB));
(gdb) p pxCurrentTCB
$4 = (TCB_t * volatile) 0x64227b20

aggarg · December 2, 2024, 5:03am

This seems like a memory corruption as I mentioned before. Did you try my previous suggestion? Also, can you try to disable parts of your application to narrow down the problem?

milhead · December 3, 2024, 3:54pm

Will I just hit another one and it’s odd… I had originally bracketed the pxCurrentTCB pointer with a fixed variable on each side.

portDONT_DISCARD PRIVILEGED_DATA TCB_t * volatile pxCurrentTCB_pre = (TCB_t *) 0xdeadbeef;
portDONT_DISCARD PRIVILEGED_DATA TCB_t * volatile pxCurrentTCB = NULL;
portDONT_DISCARD PRIVILEGED_DATA TCB_t * volatile pxCurrentTCB_post = (TCB_t *) 0xac987654;

After running under moderate CAN traffic for about 5 minutes I hit my bus fault.

The really odd part was that the value of the pxCurrentTCB was pointed at an arbrutary space above my statically allocated heartbeat task stack? which I report at startup as

main: TaskCreated: Heartbeat, stack address: 0x0x20006c00, TCB address: 0x0x200020d8

After the fault the bookends that I put around the pxCurrentTCB pointer are unchanged.

The pxCurrentTCB pointer is set to somewhere beyond even the end of the uxHeartbeatTaskStack as is indicated below.

Thread 2 hit Breakpoint 3, BusFault_Handler () at /home/miller/src/vip-rcip/vip/src/system/stm32h5xx_it.c:127
127  while (1)
(gdb) bt
#0  BusFault_Handler () at /home/miller/src/vip-rcip/vip/src/system/stm32h5xx_it.c:127
#1  0xffffffac in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p pxCurrentTCB
$1 = (TCB_t * volatile) 0x20007ba3 <uxHeartbeatTaskStack+4003>
(gdb) p pxCurrentTCB_pre
$2 = (TCB_t * volatile) 0xdeadbeef
(gdb) p pxCurrentTCB_post
$3 = (TCB_t * volatile) 0xac987654

xuelix · December 3, 2024, 8:53pm

Were you able to set a data access breakpoint on pxCurrentTCB to see what is trying to modify it?
Would be interesting to see what is in your pxReadyTasksList[your task priority] while running.

milhead · December 3, 2024, 10:42pm

I can enter the watch on pxCurrentTCB but it hits all the time (which I guess I expect.) Where my arm-none-eabi-gdb fails is when I tried to set a condition to only break when I hit an invalid TCB… I only have 8 tasks. All statically allocated.

[       0] main: TaskCreated: CAN, TCB address: 0x0x20001800
[       0] main: TaskCreated: Status, TCB address: 0x0x20001c6c
[       0] main: TaskCreated: Heartbeat, TCB address: 0x0x200020d8
[       0] main: TaskCreated: Logging, TCB address: 0x0x20002544
[       0] main: TaskCreated: SFI SPI Task, TCB address: 0x0x200029b0
[       0] main: TaskCreated: SFI to VIP, TCB address: 0x0x20002e1c
[       0] main: FreeRTOS Timer Task TCB   0x0x20003288
[       0] main: FreeRTOS Idle Task TCB    0x0x200032f4

I tried to setup gdb with:

(gdb) watch pxCurrentTCB
Watchpoint 5: pxCurrentTCB
(gdb) condition 5 pxCurrentTCB != 0x0x20001800 && pxCurrentTCB != 0x0x20001c6c && pxCurrentTCB != 0x0x200020d8 && pxCurrentTCB != 0x0x20002544 && pxCurrentTCB != 0x0x200029b0 && pxCurrentTCB != 0x0x200029b0 && pxCurrentTCB != 0x0x20003288 && pxCurrentTCB != 0x0x200032f4

But it’s breaking on every update on pxCurrentTCB which is pretty useless.

In my latest cut i’m writing a isValidTCB() macro so that I can instrument FreeRTOS to assert if one of the half dozen places where tasks.c and list.h actually updates pxCurrentTCB.

used in tasks.c like…

                    configASSERT(isValidTCB( pxNewTCB ));
                    pxCurrentTCB = pxNewTCB;

isValidTCB() will look something like…

    #define isValidTCB(x)     (     ((uint32_t)(x) == 0x20001800)   \
                                 || ((uint32_t)(x) == 0x20001c6c)   \
                                 || ((uint32_t)(x) == 0x200020d8)   \
                                 || ((uint32_t)(x) == 0x20002544)   \
                                 || ((uint32_t)(x) == 0x200029b0)   \
                                 || ((uint32_t)(x) == 0x20002e1c)   \
                                 || ((uint32_t)(x) == 0x20003288)   \
                                 || ((uint32_t)(x) == 0x200032f4)   \
                                 )

…Stay tuned, still working on it

milhead · December 3, 2024, 10:48pm

If this works I may build something a little more elegant that does not have literal TCB pointers but for my current purposes it ‘should’ work.

aggarg · December 4, 2024, 9:15am

That is not necessarily wrong as pxCurrentTCB is not supposed be in the task stack range because- TCB is not kept on the stack.

milhead · December 4, 2024, 2:12pm

As I understand it pxCurrentTCB should only point to valid TCB structures once the scheduler is running. Since all my TCB’s are statically allocated their addresses are pretty consistent.

The fact that the pxCurrentTCB pointer was assigned a value in one of the tasks stack space (also all statically allocated) was just an observation I did not mean to conclude anything from it.

aggarg · December 4, 2024, 4:13pm

Understood thanks. Did you try to read the memory at that address and see if it looks like a TCB like it has a task name?

milhead · December 17, 2024, 5:55am

I believe that this was what stepped on our FreeRTOS variables… The fact that a pxCurrentTCB corruption usually quickly turns into a processor bus fault or some other memory violation and the tools we were using on the STM32H57 could not provide a backtrace made it a tough debug (well tough for me anyway)… We have open tickets with the tools vendor to repair the debugger. In the end V13 arm-none-eabi-gdb was able to trap and backtrace the modification of a non-changing variable near the one being corrupted.

Thanks Again!

milhead · December 17, 2024, 6:06am

Hello Everyone,

I apologize for the late reply but was out of town and supporting the product we had this issue on.

I think Gaurav Aggarwal get’s credit for predicting the cause as long as this specific corruption stays down.

The final analysis is not in yet but a sizeable portion of the .bss was being overwritten by code in the ST SPI HAL code. I cannot yet say if the API was being misused by our developers or if we had encountered a untested corner case with the HAL code. We are having a contractor look into our use of the API and the HAL code to make sure we are not at risk of the same condition in other uses.

I’m not sure about the legality of posting another companies code here but can describe the case we encountered as a decrementing counter was somehow counting down to zero and somehow indexing past it so instead of moving a small number bytes out of the STM32H57 SPI peripheral to a .bss buffer it was attempting to copy that value plus 64k additional bytes.

In one of our last actions I was talking about above I was attempting to surround pxCurrentTCB pointer storage location with some fixed value variables that I could watch for write modification with gdb. It took me an embarrassing amount of time to realize that the pxCurrentTCB pointer is not initialized hence zeroed with the rest of the .bss. I surrounded it nicely in the code with some initialized value pointer variables… I talk about .bss/.data storage selection all the time but neglected to listen to myself when wrapping the pxCurrentTCB. (I earn my crow meal here… sigh)…

Once I added the decoration to force pxCurrentTCB into the .data segment it no longer suffered corruption… Instead some other .bss located FreeRTOS list array structure was getting corrupted instead… Wrapping this reproducible corruption in bookend variables that I was careful to make sure they stayed in the corrupt .bss segment, It was a quick operation to watch the book-end variables and backtrace to the offending code. (ST SPI HAL).

Since we added protection in the HAL code to workaround the off-by-one error we have not seen a hard-fault caused by a corrupt FreeRTOS structure or pointer.

Thanks again for everyone’s feedback and tolerance, Forums like this and the time that developers contributed that are invaluable to developers in over their heads in a situation that can be hard to get out of…

Miller

RAc · December 17, 2024, 10:09am

thanks for the update!

In developing software for RTOSs, it is a very common pitfall to chase after the symptoms, not the root cause - a memory corruption typically manifests itself many many cycles after the corruption happened. With enough experience, you will sooner or later “look through” the symptoms and know what else to look for, there are usual suspects.

If the symptoms are reproducable, hardware breakpoints are a big help in encircling the root cause (I believe that @aggarg already mentioned that).

aggarg · December 17, 2024, 10:23am

Thank you for reporting back! Good debug!