Unexpected sporadic errors when trying to release previously acquired recursive mutexes

I experience unexpected behavior on Win32 POrt when trying to use recursive mutexes. The program works during a few to a couple of tens minutes, and then one of xSemaphoreGiveRecursive() funtion returns false. This occurs in several places of the program code, it seems, that almost every mutex in this program is susceptible to this (sporadic) error. The problem is that on all of these places, before xSemaphoreGiveRecursive(), there is always a corresponding call to xSemaphoreTakeRecursive(), and between them usually (but not always) there are short code with few lines and with no FreeRTOS API calls. xSemaphoreGiveRecursive() returns FALSE only if mutex is not currently being hold by the calling process. It is strange for me, and resembles either memory corruption, or some bug in FreeRTOS Win32 port, (or misuse of some functions of it). Can anyone suggest any ideas on how to debug this?

When running the same source code as firmware in an MCU, with Cortex-M FreeRTOS kernel port, I do not recall to observe such errors.

In addition, I did some work to log out some system state after xSemaphoreGiveRecursive() retruns FALSE when it SHOULD return TRUE. I included “tasks.c” into another source file, via #include instruction, and now print, among LINE and FILE of the erroneous line, also pxCurrentTCB.pcTaskName, just after the mentioned error occurs. Sometimes pxCurrentTCB.pcTaskName is wrong, and shows another thread which do not have any relation to working with a given Mutex, sometimes some stub thread which contains essentially only vTaskDelay(10) inside, but with higher priority than majority of other threads. The problem seems to be here, apparently, pxCurrentTCB got messed for some reasons. What can be done in this situtation?

Almost certainly a memory access error (write past a bounded array or something the like) that happens to manfest itself in corruption of system objects. Reduce your application to bare minimum, then build it up piece by piece until you hit the problem.

I reconsidered prvProcessSimulatedInterrupts() function code, and it seems that this problem caused not by memory corruption, but by the BUG in FreeRTOS Win32/64 Port code which leads to race conditions, around vTaskSwitchContext() call inside prvProcessSimulatedInterrupts() scheduler loop. The problem is that inside vTaskSwitchContext()->taskSELECT_HIGHEST_PRIORITY_TASK()
->listGET_OWNER_OF_NEXT_ENTRY() there ia a line:

( pxTCB ) = ( pxConstList )->pxIndex->pvOwner;
there pxTCB == pxCurrentTCB

BUT, the thread which was assigned to the previously selected pxCurrentTCB, is NOT suspended during execution of vTaskSwitchContext(). If inside listGET_OWNER_OF_NEXT_ENTRY() a new thread will be selected for execution, then the previously executed will be suspended only a few lines below:



			vTaskSwitchContext();

			/* If the task selected to enter the running state is not the task
			that is already in the running state. */
			if( pvOldCurrentTCB != pxCurrentTCB )
			{
				/* Suspend the old thread.  In the cases where the (simulated)
				interrupt is asynchronous (tick event swapping a task out rather
				than a task blocking or yielding) it doesn't matter if the
				'suspend' operation doesn't take effect immediately - if it
				doesn't it would just be like the interrupt occurring slightly
				later.  In cases where the yield was caused by a task blocking
				or yielding then the task will block on a yield event after the
				yield operation in case the 'suspend' operation doesn't take
				effect immediately.  */
				pxThreadState = ( ThreadState_t *) *( ( size_t * ) pvOldCurrentTCB );
                -->>> SuspendThread( pxThreadState->pvThread );

Since there is a time gap between re-assigning pxCurrentTCB pointer and the actual suspension of previously active thread, if that thread checks for pxCurrentTCB variable in this same moment, the returned value will be incorrect.

I added the following quick fix to prvProcessSimulatedInterrupts() function:

            if (pvOldCurrentTCB)
            {
                pxThreadState = ( ThreadState_t *) *( ( size_t * ) pvOldCurrentTCB ); // <-- 1
                if (pxThreadState && pxThreadState->pvThread)
                    SuspendThread( pxThreadState->pvThread ); // <-- 2
            }

			/* Select the next task to run. */
			vTaskSwitchContext();

			/* If the task selected to enter the running state is not the task
			that is already in the running state. */
			if( pvOldCurrentTCB != pxCurrentTCB )
			{
				/* Suspend the old thread.  In the cases where the (simulated)
				interrupt is asynchronous (tick event swapping a task out rather
				than a task blocking or yielding) it doesn't matter if the
				'suspend' operation doesn't take effect immediately - if it
				doesn't it would just be like the interrupt occurring slightly
				later.  In cases where the yield was caused by a task blocking
				or yielding then the task will block on a yield event after the
				yield operation in case the 'suspend' operation doesn't take
				effect immediately.  */
				pxThreadState = ( ThreadState_t *) *( ( size_t * ) pvOldCurrentTCB );
                // SuspendThread( pxThreadState->pvThread );

				/* Ensure the thread is actually suspended by performing a
				synchronous operation that can only complete when the thread is
				actually suspended.  The below code asks for dummy register
				data.  Experimentation shows that these two lines don't appear
				to do anything now, but according to
				https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743
				they do - so as they do not harm (slight run-time hit). */
				xContext.ContextFlags = CONTEXT_INTEGER;
				( void ) GetThreadContext( pxThreadState->pvThread, &xContext );

				/* Obtain the state of the task now selected to enter the
				Running state. */
				pxThreadState = ( ThreadState_t * ) ( *( size_t *) pxCurrentTCB );

				/* pxThreadState->pvThread can be NULL if the task deleted
				itself - but a deleted task should never be resumed here. */
				configASSERT( pxThreadState->pvThread != NULL );
				ResumeThread( pxThreadState->pvThread );
            }
            else
            {
                if (pvOldCurrentTCB && pxThreadState && pxThreadState->pvThread)
                    ResumeThread( pxThreadState->pvThread ); // <-- 2
            }

Some checks for non-NULL pointers may be redundant, but the idea should be clear. It may be even more clean not to use Resume/Suspend API in case if pxCurrentTCB was not changed. Maybe, by making vTaskSwitchContext() not to update pxCurrentTCB variable inside, but to return pointer to newly selected thread to run.

After this change, strange sporadic errors stated in the topic, seems to disappear (at least for now the program is running for ~30 minutes without them, while previous best result, before these edits, was about ~10 minutes).

ok, apologies, it slipped my attention that this is not native but simulated code. Thanks for sharing your analysis!

@Vitaly_S Thank you for sharing your analysis. Would you please try with the following snippet and see that it addresses the issue you mentioned:

            if( ulSwitchRequired != pdFALSE )
            {
                /* Suspend the old thread. */
                pxThreadState = ( ThreadState_t * ) *( ( size_t * ) pxCurrentTCB );
                SuspendThread( pxThreadState->pvThread );

                /* Ensure the thread is actually suspended by performing a
                 * synchronous operation that can only complete when the thread
                 * is actually suspended. The below code asks for dummy register
                 * data. Experimentation shows that these two lines don't appear
                 * to do anything now, but according to
                 * https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743
                 * they do - so as they do not harm (slight run-time hit). */
                xContext.ContextFlags = CONTEXT_INTEGER;
                ( void ) GetThreadContext( pxThreadState->pvThread, &xContext );

                /* Select the next task to run. */
                vTaskSwitchContext();

                /* Obtain the state of the task now selected to enter the
                 * Running state. */
                pxThreadState = ( ThreadState_t * ) ( *( size_t * ) pxCurrentTCB );

                /* pxThreadState->pvThread can be NULL if the task deleted
                 * itself - but a deleted task should never be resumed here. */
                configASSERT( pxThreadState->pvThread != NULL );
                ResumeThread( pxThreadState->pvThread );
            }

I compiled the same program on the same platform with proposed corrections. Yes, now it works, at least, for now it is running for almost 1,5 hours without errors (vs max. 10 minutes with the unfixed version of prvProcessSimulatedInterrupts()).

Thank you for sharing results!

This PR addresses the issue - Fix race in prvProcessSimulatedInterrupts by aggarg · Pull Request #1055 · FreeRTOS/FreeRTOS-Kernel · GitHub.