I’m in the process of upgrading several STM32F4 and STM32H7 based projects from FreeRTOS 8.2.3 to 10.4.3 and I’ve been successful in all but one.
With one particular exception, a hard fault is occurring in prvCheckTasksWaitingTermination() - consider the following:
while( uxDeletedTasksWaitingCleanUp > ( UBaseType_t ) 0U )
pxTCB = listGET_OWNER_OF_HEAD_ENTRY( ( &xTasksWaitingTermination ) ); /*lint !e9079 void * is used as this macro is used with timers and co-routines too. Alignment is known to be fine as the type of the pointer stored and retrieved is the same. */
( void ) uxListRemove( &( pxTCB->xStateListItem ) );
prvDeleteTCB( pxTCB );
In one particular instance, the listGET_OWNER_OF_HEAD_ENTRY() macro returns NULL to pxTCB, and the act of removing the list item causes a hard fault due to accesses to out-of-bounds memory.
I’ve managed to track down the commit that exposed this fault to a change made four years ago that removed an empty list check in tasks.c (commit ID b5d8be22093fe62776e12e9c8e4cec72aa67132d - new users are allowed to add hyperlinks )
This was a change that was introduced between 9.0.1 and 10.0.0.
If anyone can offer any insights to potential causes of this problem that would be greatly appreciated.
I’m struggling to find that commit so I have raised your privilege level hopefully to a level where you can paste a link.
Have you defined
configASSERT? Also, please enable stack overflow checking and malloc failed checking: FreeRTOS - Open Source RTOS Kernel for small embedded systems.
configASSERT is defined and I also have malloc and stack overflow checking turned on.
Curious. As you say, the difference would appear to be only that the check to ensure the list is not empty has been removed, but on the face of it the check appears to be obsolete.
The only other difference is the addition of portTASK_CALLS_SECURE_FUNCTIONS(), but that should be an empty macro unless you are using an ARMv8-M, which you are not.
Can you try using the prior version (the version before that commit), then add an assert into prvCheckTasksWaitingTermination() to try and catch a situation where uxDeletedTasksWaitingCleanUp is greater than 0 but xListIsEmpty is true. You could do that by adding the assert above the call to mtCOVERAGE_TEST_MARKER() in the else clause here https://github.com/FreeRTOS/FreeRTOS-Kernel/commit/b5d8be22093fe62776e12e9c8e4cec72aa67132d#diff-fe5e906f0f477aeb079fc122120e4e2cead1ca33f79c2e460b04b1b380b929baL3535
Alternatively if using a debugger, add a NOP instruction in the same place and then place a breakpoint on the NOP instruction to see if the breakpoint ever gets hit.
Something I failed to mention is that the idle task is serviced while a higher priority thread is being delayed (using vTaskDelay(50)) - when the task yields it switches to the idle thread and into the prvCheckTasksWaitingTermination() function.
Reverted back to the commit immediately before that change (533b5338203ce4695b031729753ec081621da2d1) and added a __NOP() as instructed.
The breakpoint was hit at the exact stage of execution where the hard fault occurred with newer versions of that function.
The offending field that would cause the hard fault later on (xTasksWaitingTermination->xListEnd.pxNext->pvOwner) is NULL.
Number of items in the
xTasksWaitingTermination list is zero but
uxDeletedTasksWaitingCleanup is not. Seems like some memory corruption - can you set a data breakpoint on
uxDeletedTasksWaitingCleanup and see the call-stack when it is getting written?
I’ve been able to do a deeper dive on this.
For context, the firmware that is being run executes unit tests that would spin up and tear down threads as part of each test.
When a thread is deleted within the thread task itself (using vTaskDelete(NULL)) the TCB of the thread is moved from one of the other task lists into termination task list (xTasksWaitingTermination) and the counter for tasks awaiting cleanup (uxDeletedTasksWaitingCleanUp) increments by one.
Ordinarily this would end up being cleaned up within the idle task list where the list size and cleanup counter would be consistent. But before it gets there, vTaskDelete(…) is called from another thread with that task handle, essentially doing it a second time.
This would call uxListRemove(…) on that TCB, inadvertently removing it from the termination task list without updating uxDeletedTasksWaitingCleanUp. Those two variables that should be in sync now start drifting apart.
By the time the idle task calls prvCheckTasksWaitingTermination(), uxDeletedTasksWaitingCleanUp is 8, but the termination list is empty.
After removing the second call to vTaskDelete(…), the breakpoint for that __NOP() is not hit.
Then moved FreeRTOS to 10.4.3 - it is now working
It does raise an interesting question in prvCheckTasksWaitingTermination() - what is the purpose of having a separate counter for uxDeletedTasksWaitingCleanUp when there is already a counter within the termination list?
Is there still a bug with the way terminated tasks are accounted for within the RTOS? There is a risk that inadvertently calling vTaskDelete(…) twice within a thread and out of the thread causes these counters to drift apart.
is a bug in my opinion. Like double
I would agree with you on this - my point being that there is nothing in the implementation to guard against it.
The fact that vTaskDelete(…) could be called outside a thread (with a task handle) and inside a thread (with a NULL) may be an invitation to others to do it twice.
Maybe a flag can be added in the TCB to indicate that it is queued for termination so that a second call to vTaskDelete(…) could be ignored.
It is a fundamental problem to delete a task twice, and while you can catch this one specific case with this sort of test, there are many others that you can not, especially if the second delete happens after memory has been reclaimed, and possibly reused.
This is exactly the same problem as a double free.
Fundamentally, only the exclusive owner of the task handle should delete it, because after the delete call, no other user task should make any reference to this handle or face undefined behavior. If someone else has possible access the the handle, that needs to be some how removed before you delete the task.
Thanks for the great debugging effort and information. I’m not at my computer just now so can’t look at the source code to get a clear understanding of the report. Are we saying the kernel code deletes a task twice, or that the application code (test code I think in this case) was deleting the same task twice?
To answer your question: there is a separate variable that counts the number of tasks waiting termination when the list already contains a count to prevent the idle task from continuously entering and exiting a critical section in the idle task loop. The idle task can check the variable without a critical section but can’t check the list without one.
This comment explains it: FreeRTOS-Kernel/tasks.c at main · FreeRTOS/FreeRTOS-Kernel · GitHub
Reading this again I see this line:
answers my question - so it is the application that is deleting the same task twice rather than the kernel doing it. If I’m correct then that means there isn’t a bug in the kernel - although it would be helpful to add an assert() into the code that performs the check that was previously removed - namely check the list is no empty before trying to remove something from the list.
I agree that one less critical section block is helpful, it may also be unnecessary since the only modifications made to xTasksWaitingTermination outside of prvCheckTasksWaitingTermination() is when a task commits suicide.
I think it would be perfectly safe to replace:
while( uxDeletedTasksWaitingCleanUp > ( UBaseType_t ) 0U )
while( listCURRENT_LIST_LENGTH( &xTasksWaitingTermination ) > ( UBaseType_t ) 0U )
… and to make uxDeletedTasksWaitingCleanUp redundant.
Thank you for giving this some attention.
I am in agreement that double-deletes are not good in any situation - just surprised that it manifested itself after an RTOS update.