I am still receiving frequent occurrences of Null or invalid pxCurrenbtTCB pointers in FreeRTOS when calling ISR approved functions. Most of the literature and forum posts point to the ISR priority setup as a culprit but I believe that I am configured correctly.
In the ever to frequent stack trace below I hit an assert I added to FreeRTOS tasks.c (where I validate that pxCurrentTCB is in valid ram on every dereference.
I am using the following ST HAL call
/* FDCAN1 IT0 interrupt Init */
HAL_NVIC_SetPriority(FDCAN1_IT0_IRQn, 5, 0);
To set the interrupt priority and am a bit confused I have tried to use
configMAX_SYSCALL_INTERRUPT_PRIORITY+5 as the parameter as well but gets the same result
I’ve used FreeRTOS for years on several processors without ever being haunted by this persistant of a system failure and am sure it’s in my setup somewhere.
Have you turned on (defined) configASSERT()? That might be able to capture issues regarding interrupt priority configuration and what might cause the memory corruption.
Can you check the implementation of HAL_NVIC_SetPriority to ensure that it shifts the value correct to account for __NVIC_PRIO_BITS? Also, as @xuelix suggested, please define configASSERT.
I do have config assert mapped to my assert implementation. The FreeRTOS distro does not check all dereferences of pxCurrentTCB so it’s hitting an assert I have added that verifies that a pointer points to a valid RAM address.
which sets Priority if Interrupt 55 to 0x50 but since the __NVIC_PRIO_BITS is 4 the actual value of the byte register os 80 which I think is an ok priority ok executing FreeRTOS IDR-ready Metghods.
With many of the pxCurrentTCB dereferences in place I now assert in place of hard-faulting in many instances. Below is an instance of an invalid TCB pointer.
note: Valid Ram is between 0x20000000 and 0x2009FFFF
Thread 2 hit Breakpoint 2, _assert_failed (assertion=0x8127004 "(((uint32_t)(pxCurrentTCB) >= 0x20000000) && ((uint32_t)(pxCurrentTCB) <= 0x2009FFFF))",
file=0x8126f14 "/home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c", line=3033) at /home2/miller/src/vip-rcip/vip/libs/assert/assert.c:123
123 if (! isAssertAlreadyInProgress)
(gdb) bt
#0 _assert_failed (assertion=0x8127004 "(((uint32_t)(pxCurrentTCB) >= 0x20000000) && ((uint32_t)(pxCurrentTCB) <= 0x2009FFFF))",
file=0x8126f14 "/home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c", line=3033) at /home2/miller/src/vip-rcip/vip/libs/assert/assert.c:123
#1 0x080a0f8c in vTaskSwitchContext () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c:3033
#2 0x080a3f48 in PendSV_Handler () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/portable/GCC/ARM_CM33_NTZ/non_secure/portasm.c:236
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up
#1 0x080a0f8c in vTaskSwitchContext () at /home2/miller/src/vip-rcip/vip/libs/FreeRTOSV101/Source/tasks.c:3033
3033 configASSERT( isValidRAM(pxCurrentTCB));
(gdb) p pxCurrentTCB
$4 = (TCB_t * volatile) 0x64227b20
This seems like a memory corruption as I mentioned before. Did you try my previous suggestion? Also, can you try to disable parts of your application to narrow down the problem?
After running under moderate CAN traffic for about 5 minutes I hit my bus fault.
The really odd part was that the value of the pxCurrentTCB was pointed at an arbrutary space above my statically allocated heartbeat task stack? which I report at startup as
Were you able to set a data access breakpoint on pxCurrentTCB to see what is trying to modify it?
Would be interesting to see what is in your pxReadyTasksList[your task priority] while running.
I can enter the watch on pxCurrentTCB but it hits all the time (which I guess I expect.) Where my arm-none-eabi-gdb fails is when I tried to set a condition to only break when I hit an invalid TCB… I only have 8 tasks. All statically allocated.
But it’s breaking on every update on pxCurrentTCB which is pretty useless.
In my latest cut i’m writing a isValidTCB() macro so that I can instrument FreeRTOS to assert if one of the half dozen places where tasks.c and list.h actually updates pxCurrentTCB.
As I understand it pxCurrentTCB should only point to valid TCB structures once the scheduler is running. Since all my TCB’s are statically allocated their addresses are pretty consistent.
The fact that the pxCurrentTCB pointer was assigned a value in one of the tasks stack space (also all statically allocated) was just an observation I did not mean to conclude anything from it.
I believe that this was what stepped on our FreeRTOS variables… The fact that a pxCurrentTCB corruption usually quickly turns into a processor bus fault or some other memory violation and the tools we were using on the STM32H57 could not provide a backtrace made it a tough debug (well tough for me anyway)… We have open tickets with the tools vendor to repair the debugger. In the end V13 arm-none-eabi-gdb was able to trap and backtrace the modification of a non-changing variable near the one being corrupted.
I apologize for the late reply but was out of town and supporting the product we had this issue on.
I think Gaurav Aggarwal get’s credit for predicting the cause as long as this specific corruption stays down.
The final analysis is not in yet but a sizeable portion of the .bss was being overwritten by code in the ST SPI HAL code. I cannot yet say if the API was being misused by our developers or if we had encountered a untested corner case with the HAL code. We are having a contractor look into our use of the API and the HAL code to make sure we are not at risk of the same condition in other uses.
I’m not sure about the legality of posting another companies code here but can describe the case we encountered as a decrementing counter was somehow counting down to zero and somehow indexing past it so instead of moving a small number bytes out of the STM32H57 SPI peripheral to a .bss buffer it was attempting to copy that value plus 64k additional bytes.
In one of our last actions I was talking about above I was attempting to surround pxCurrentTCB pointer storage location with some fixed value variables that I could watch for write modification with gdb. It took me an embarrassing amount of time to realize that the pxCurrentTCB pointer is not initialized hence zeroed with the rest of the .bss. I surrounded it nicely in the code with some initialized value pointer variables… I talk about .bss/.data storage selection all the time but neglected to listen to myself when wrapping the pxCurrentTCB. (I earn my crow meal here… sigh)…
Once I added the decoration to force pxCurrentTCB into the .data segment it no longer suffered corruption… Instead some other .bss located FreeRTOS list array structure was getting corrupted instead… Wrapping this reproducible corruption in bookend variables that I was careful to make sure they stayed in the corrupt .bss segment, It was a quick operation to watch the book-end variables and backtrace to the offending code. (ST SPI HAL).
Since we added protection in the HAL code to workaround the off-by-one error we have not seen a hard-fault caused by a corrupt FreeRTOS structure or pointer.
Thanks again for everyone’s feedback and tolerance, Forums like this and the time that developers contributed that are invaluable to developers in over their heads in a situation that can be hard to get out of…
In developing software for RTOSs, it is a very common pitfall to chase after the symptoms, not the root cause - a memory corruption typically manifests itself many many cycles after the corruption happened. With enough experience, you will sooner or later “look through” the symptoms and know what else to look for, there are usual suspects.
If the symptoms are reproducable, hardware breakpoints are a big help in encircling the root cause (I believe that @aggarg already mentioned that).