Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

n1vg wrote on Tuesday, October 09, 2018:

I’m running FreeRTOS 10.0.1 on a Kinetis MK22FN1M0AVLH12 (Cortex M4F core), specifically the MCUonEclipse port, with MCUXpresso and gcc.

I’ve been chasing an intermittent crash for weeks and I’m finally narrowing in on it. I have a temporary fix that seems to be working but I don’t see why it’d be necessary. It’s been hard to track down until now because it’d usually take hours to crash.

What’s happening is that vTaskSwitchContext is sometimes being interrupted by a flex timer peripheral IRQ which uses xTaskNotifyFromISR. Normally this works fine, but if the interrupt happens at the wrong place in listGET_OWNER_OF_NEXT_ENTRY it ends up setting pxCurrentTCB to null and the system crashes - with the stack clobbered, which has been slowing down my debugging. The SysTick and PendSV priorities are set to 15 (minimum), and the maximum API call priority is 5. The timer IRQ is set to priority 7 at the moment.

To make the problem more apparent and speed up testing, I changed listGET_OWNER_OF_NEXT_ENTRY to include a delay (line continuation backslashes omitted for clarity):

#define listGET_OWNER_OF_NEXT_ENTRY( pxTCB, pxList )								
{																					
List_t * const pxConstList = ( pxList );
	/* Increment the index to the next item and return the item, ensuring */
	/* we don't return the marker used at the end of the list.  */
	( pxConstList )->pxIndex = ( pxConstList )->pxIndex->pxNext;
	if( ( void * ) ( pxConstList )->pxIndex == ( void * ) &( ( pxConstList )->xListEnd ) )
	{
		( pxConstList )->pxIndex = ( pxConstList )->pxIndex->pxNext;
	}
	for (int ix = 2000; ix; ix--) {};
	( pxTCB ) = ( pxConstList )->pxIndex->pvOwner;
}

This just stretches out the time between the adjacent lines of code by a very large amount, so instead of taking several hours to crash it does so in seconds.

My fix for the moment is to globally disable interrupts in vTaskSwitchContext before taskSELECT_HIGHEST_PRIORITY_TASK:

        __asm ("CPSID  i");
		taskSELECT_HIGHEST_PRIORITY_TASK();
		traceTASK_SWITCHED_IN();
        __asm ("CPSIE  i");

I’m assuming there must be something wrong with my configuration, or maybe an oversight in this particular port, but I don’t know the FreeRTOS internals well enough to know where that might be. I’ve been through every resource I can find on Cortex M3/4 ports.

Any ideas?

Thanks!

Scott

rtel wrote on Wednesday, October 10, 2018:

Thanks for providing this level of detail. I can try and replicate your
scenario - with the additional delay included - but first a question
about vTaskSwitchContext() getting interrupted by that has a priority
lower than configMAX_SYSCALL_INTERRUPT_PRIORITY as that should not happen.

vTaskSwitchContext() is called from the PendSV handler, at the time of
writing it can be seen on line 455 here:
https://sourceforge.net/p/freertos/code/HEAD/tree/trunk/FreeRTOS/Source/portable/GCC/ARM_CM4F/port.c
On lines 451 and 452 you can see that the basepri register is set to
configMAX_SYSCALL_INTERRUPT_PRIORITY, which you said it 5, which should
then prevent the timer interrupt from executing if it has a priority of
7 (5 being a higher priority than 7 on Cortex-M).

Would you agree with my description here? If so, do you have
configASSERT() defined? Are there any other interrupts executing that
might be the culprit?

n1vg wrote on Wednesday, October 10, 2018:

Hi Richard,

Thanks for the quick reply. My port.c looks different - I believe this one came from Erich Styger’s PEx components at mcuoneclipse.com originally. I know this port has been modified to support Newlib’s malloc() for heap allocation, not sure what else might be different. Here’s the vPortPendSVHandler:

#if (configCOMPILER==configCOMPILER_ARM_GCC)
#if configGDB_HELPER
/* prototypes to avoid compiler warnings */
__attribute__ ((naked)) void vPortPendSVHandler_native(void);
__attribute__ ((naked)) void PendSV_Handler_jumper(void);

__attribute__ ((naked)) void vPortPendSVHandler_native(void) {
#elif !MCUC1_CONFIG_PEX_SDK_USED /* the SDK expects different interrupt handler names */
__attribute__ ((naked)) void PendSV_Handler(void) {
#else
__attribute__ ((naked)) void vPortPendSVHandler(void) {
#endif
#if configCPU_FAMILY_IS_ARM_M4_M7(configCPU_FAMILY) /* Cortex M4 or M7*/
  __asm volatile (
    " mrs r0, psp                \n"
    " ldr  r3, pxCurrentTCBConst \n" /* Get the location of the current TCB. */
    " ldr  r2, [r3]              \n"
#if configCPU_FAMILY_IS_ARM_FPU(configCPU_FAMILY)
    " tst r14, #0x10             \n" /* Is the task using the FPU context?  If so, push high vfp registers. */
    " it eq                      \n"
    " vstmdbeq r0!, {s16-s31}    \n"

    " stmdb r0!, {r4-r11, r14}   \n" /* save remaining core registers */
#else
    " stmdb r0!, {r4-r11}        \n" /* Save the core registers. */
#endif
    " str r0, [r2]               \n" /* Save the new top of stack into the first member of the TCB. */
    " stmdb sp!, {r3, r14}       \n"
    " mov r0, %0                 \n"
    " msr basepri, r0            \n"
    " bl vTaskSwitchContext      \n"
    " mov r0, #0                 \n"
    " msr basepri, r0            \n"
    " ldmia sp!, {r3, r14}       \n"
    " ldr r1, [r3]               \n" /* The first item in pxCurrentTCB is the task top of stack. */
    " ldr r0, [r1]               \n"
#if configCPU_FAMILY_IS_ARM_FPU(configCPU_FAMILY)
    " ldmia r0!, {r4-r11, r14}   \n" /* Pop the core registers */
    " tst r14, #0x10             \n" /* Is the task using the FPU context?  If so, pop the high vfp registers too. */
    " it eq                      \n"
    " vldmiaeq r0!, {s16-s31}    \n"
#else
    " ldmia r0!, {r4-r11}        \n" /* Pop the core registers. */
#endif
    " msr psp, r0                \n"
    " bx r14                     \n"
    "                            \n"
    " .align 2                   \n"
    "pxCurrentTCBConst: .word pxCurrentTCB  \n"
      ::"i"(configMAX_SYSCALL_INTERRUPT_PRIORITY)
  );
#else /* Cortex M0+ */
  __asm volatile (
    " mrs r0, psp                \n"
    "                            \n"
    " ldr r3, pxCurrentTCBConst  \n" /* Get the location of the current TCB. */
    " ldr r2, [r3]               \n"
    "                            \n"
    " sub r0, r0, #32            \n" /* Make space for the remaining low registers. */
    " str r0, [r2]               \n" /* Save the new top of stack. */
    " stmia r0!, {r4-r7}         \n" /* Store the low registers that are not saved automatically. */
    " mov r4, r8                 \n" /* Store the high registers. */
    " mov r5, r9                 \n"
    " mov r6, r10                \n"
    " mov r7, r11                \n"
    " stmia r0!, {r4-r7}         \n"
    "                            \n"
    " push {r3, r14}             \n"
    " cpsid i                    \n"
    " bl vTaskSwitchContext      \n"
    " cpsie i                    \n"
    " pop {r2, r3}               \n" /* lr goes in r3. r2 now holds tcb pointer. */
    "                            \n"
    " ldr r1, [r2]               \n"
    " ldr r0, [r1]               \n" /* The first item in pxCurrentTCB is the task top of stack. */
    " add r0, r0, #16            \n" /* Move to the high registers. */
    " ldmia r0!, {r4-r7}         \n" /* Pop the high registers. */
    " mov r8, r4                 \n"
    " mov r9, r5                 \n"
    " mov r10, r6                \n"
    " mov r11, r7                \n"
    "                            \n"
    " msr psp, r0                \n" /* Remember the new top of stack for the task. */
    "                            \n"
    " sub r0, r0, #32            \n" /* Go back for the low registers that are not automatically restored. */
    " ldmia r0!, {r4-r7}         \n" /* Pop low registers.  */
    "                            \n"
    " bx r3                      \n"
    "                            \n"
    ".align 2                    \n"
    "pxCurrentTCBConst: .word pxCurrentTCB"
  );
#endif
}

My ARM assembly language is really weak - what’s the %0 notation? Looks like it assembled as #0x50, which is correct. I’m assuming that must be an assembly-time substitution but I don’t know what it references.

When I catch the race condition, BASEPRI is 0. I’ve tried to step through the PendSV handler but the debugger seems determined to make sure I don’t get to see it happening - every time, it hangs before it gets to the msr BASEPRI, r0 instruction. The gdb server sits there with high CPU utilization, whether I’m using an LPC-Link2 or a P&E Cyclone ACP.

I’ve attached a screenshot so you can see where it was at. r0 is set to 0x50, so that’s right. There are other interrupts enabled, but I’ve set all of the NVIC priority registers to values in the range of 0x50 to 0x80 whether the interrupt is implemented or not, just in case. I once got the debugger to stop in vPortSwitchContext and saw BASEPRI set, but when it stepped over the stack check it came up 0 again. It also stopped reading registers entirely right after that so it’s hard to say what was actually going on.

This is where I’d normally insert a rant about the unreliability of NXP’s debugging tools, but I’ve got at least a partial workaround for the bug that’s been plauging this project for weeks, we just started shipping another new product today, and I had a really great first date last night, so I’m going to quit while I’m still in a good mood.

I’ll see what I can find out from Erich about the origin of this port.c, and fuss with the debugger some more tomorrow.

Thanks again,

Scott

rtel wrote on Wednesday, October 10, 2018:

Here’s the vPortPendSVHandler:

Although different, the assembly instructions that set the basepri
register are still in place.

My ARM assembly language is really weak - what’s the %0 notation? Looks
like it assembled as #0x50, which is correct. I’m assuming that must be
an assembly-time substitution but I don’t know what it references.

It is referencing the following constant:

   ::"i"(configMAX_SYSCALL_INTERRUPT_PRIORITY)

When I catch the race condition, BASEPRI is 0. I’ve tried to step
through the PendSV handler but the debugger seems determined to make
sure I don’t get to see it happening - every time, it hangs before it
gets to the msr BASEPRI, r0 instruction. The gdb server sits there with
high CPU utilization, whether I’m using an LPC-Link2 or a P&E Cyclone ACP.

Do you set the debugger to step assembly instructions rather than C
instructions before stepping through? As this is inline assembly you
would need to do that, then open the assembly window in the debugger and
step through the code there.

That function should never be executing with basepri set to 0 - so that
is the curiosity here. Is the function called from anywhere else other
than the function that starts the scheduler in the first place? Are you
100% sure there are no other interrupts running? Did you ensure
configASSERT() was defined - that should catch misconfigurations -
assuming that part of the code has not been changed. You could try
running the official port to see if that has the same issues.

I’ve attached a screenshot so you can see where it was at. r0 is set to
0x50, so that’s right. There are other interrupts enabled, but I’ve set
all of the NVIC priority registers to values in the range of 0x50 to
0x80 whether the interrupt is implemented or not, just in case. I once
got the debugger to stop in vPortSwitchContext and saw BASEPRI set, but
when it stepped over the stack check it came up 0 again. It also stopped
reading registers entirely right after that so it’s hard to say what was
actually going on.

Maybe you are not seeing what you think you are seeing here. Are you
100% sure the issue is occurring inside the tick interrupt? Could the
issue actually be being caused elsewhere in the system, and only become
evident in the tick interrupt?

xz8987f wrote on Wednesday, October 10, 2018:

Richard, Scott,
I’m going to check this one one my side too. That tick handler is a combination of multiple ports to support many architectures. I’ll check if I have missed anything.

Erich

n1vg wrote on Wednesday, October 10, 2018:

Yes, the debugger was set to instruction stepping mode. And no, nothing else calls vTaskSwitchContext. Here’s the sequence of functions from the instruction trace I captured during one crash:

vQueueWaitForMessageRestricted
prvProcessTimerOrBlockTask
xTaskResumeAll
vPortEnterCritical
xTaskResumeAll
vPortExitCritical
xTaskResumeAll
prvProcessTimerOrBlockTask
vPortYieldFromISR
vPortPendSVHandler
vTaskSwitchContext
xTaskGetTickCountFromISR
vPortValidateInterruptPriority
xTaskGetTickCountFromISR
vTaskSwitchContext
ftm0_irq
vTaskSwitchContext
(crashes because pxCurrentTCB is now null)

configASSERT is defined. I modified it yesterday to add a bkpt instruction to the infinite loop so it’d stop the debugger immediately, but it definitely works, and catches when I screw up interrupt priorities.

Are you 100% sure there are no other interrupts running?

There are 18 peripheral interrupts enabled, in addition to the core interrupts like SysTick and the fault handlers. All of the unused vectors point to a halt function, and all of them, used or not, have priority levels set to be safe with FreeRTOS. There may be other interrupts pending (if I pause the debugger there I get an EWM watchdog interrupt at least) but nothing that should be able to interrupt the kernel. It was definitely the FTM IRQ that caused it in the trace above, and every trace I’ve caught, and the ISR also turns on an LED that gets turned off in a few ticks whenever it generates a task notification. The LED was stuck on every time I checked during a lockup so the FTM ISR had definitely done a task notification very recently every time it locked up.

Are you 100% sure the issue is occurring inside the tick interrupt?

Which tick interrupt are you talking about? The FTM IRQ, or the SysTick that’s presumably kicking off the context switch that’s getting interrupted? I am certain that it’s the FTM IRQ (but not necessarily exclusively that one) that’s interrupting the context switch.

There are UART DMA and ADC DMA interrupts also firing more or less frequently depending on what’s going on (for WiFi data and audio samples) and each of those can wake up their respective tasks (FTM sends notifications to the main task) and I haven’t seen them cause this problem, but that could be because of their relative frequencies. Also, the FTM ISR is processing data from a GPS receiver that’s aligned to the start of the GPS second. The SysTick rate is an even 300 Hz and the clock oscillator is going to be off by some ppm relative to GPS, so presumably the start of the NMEA data (and the resulting task notifications) will be slowly slipping relative to the SysTick tick. If there’s an opening for a race condition, a slow slip like that is the way to find it. The WiFi DMA will be very bursty and dependent on traffic (I’ve had four browser clients running to crank up the traffic level) but much less consistent with timing, and the ADC DMA rate divides evenly into the SysTick rate, with both driven by the same clock, so it’s not likely to trigger a race condition.

I’m going to work on the debugger today and see if I can get it to step through the PendSV code.

Thanks,

Scott

Edit: Now that I think about it, the GPS time slip shouldn’t be the issue. The FTM overflow still fires at a consistent rate, it’s just the task notification that changes, and for testing I ignored the data entirely and had the notification happen every overflow and it still caused the lockup.

n1vg wrote on Thursday, October 11, 2018:

Ok, digging through the instruction trace and I think I’ve got something. You guys are going to have to help me understand why this is.

I’ve got configGENERATE_RUN_TIME_STATS enabled, so vTaskSwitchContext() calls xTaskGetTickCountFromISR(), which looks like this:

	uxSavedInterruptStatus = portTICK_TYPE_SET_INTERRUPT_MASK_FROM_ISR();
	{
		xReturn = xTickCount;
	}
	portTICK_TYPE_CLEAR_INTERRUPT_MASK_FROM_ISR( uxSavedInterruptStatus );

This is where it’s going off the rails - BASEPRI is being set to 0 by that last line. Here are the macros:

#define portTICK_TYPE_SET_INTERRUPT_MASK_FROM_ISR() portSET_INTERRUPT_MASK_FROM_ISR()
#define portTICK_TYPE_CLEAR_INTERRUPT_MASK_FROM_ISR( x ) portCLEAR_INTERRUPT_MASK_FROM_ISR( ( x ) )

#define portSET_INTERRUPT_MASK_FROM_ISR()     0;portSET_INTERRUPT_MASK()
#define portCLEAR_INTERRUPT_MASK_FROM_ISR(x)  portCLEAR_INTERRUPT_MASK();(void)x

So for some reason portSET_INTERRUPT_MASK_FROM_ISR is defined to always return 0, and portCLEAR_INTERRUPT_MASK_FROM_ISR just ignores the mask it’s given.

This was clearly someone’s conscious decision. I can’t tell what it was trying to accomplish, or if it was just something that was set for testing and overlooked.

Richard, looks like the problem is not on your end. Erich, do you know where this came from?

Since I’m not using Processor Expert anymore I suppose I should probably move to the official current port. My big hangup there is that I’m going to have to take another look at my heap usage. I make sparing use of the heap in application code and I don’t like having two different heap mechanisms, so I’ve been using the Newlib malloc option that I don’t think is part of the standard release. I’m going to have to go refresh my memory on heap options.

Thanks,

Scott

rtel wrote on Thursday, October 11, 2018:

This is actually the standard behaviour - so in the mainline code, not
just Erich’s, although I’m not sure what the
portTICK_TYPE_CLEAR_INTERRUPT_MASK_FROM_ISR() macro is and its
implementation does not appear to do anything other than call
portCLEAR_INTERRUPT_MASK_FROM_ISR() (which is our macro).

The critical sections used to st the interrupt mask back to the previous
mask value, but I was informed by architecture experts that that was not
necessary, and that internally the silicon would latch to the value of
the current interrupt anyway, so just setting to 0 was the most
efficient thing to do on exiting the critical section. Have you tried
updating the portSET/CLEAR_INTERRUPT_MASK_FROM_ISR() macros to actually
return the current basepri value rather than just hard code to return 0?

n1vg wrote on Thursday, October 11, 2018:

portTICK_TYPE_CLEAR_INTERRUPT_MASK_FROM_ISR() seems to be part of the mainline code too. I just grabbed 10.1.1 yesterday. Looks like the portTICK_TYPE wrappers are there to handle atomic or non-atomic tick reads depending on the port. If it’s atomic, the macros do nothing.

and that internally the silicon would latch to the value of the current interrupt anyway

Wouldn’t that still be a problem here? If I understand it correctly, vTaskSwitchContext() is running from the PendSV interrupt, which is set to the lowest priority - in this case 15. If the interrupt priority is what’s latched, then when the tick read critical section within vTaskSwitchContext() completes it’s effectively setting BASEPRI to 15 when it entered at 5, and now the FTM IRQ at priority 7 is able to interrupt what should be a critical section in taskSELECT_HIGHEST_PRIORITY_TASK().

Next question: Why would reads of tick type variables not be atomic on a Cortex M4? I think maybe that’s where our problems start - here’s the mainline portmacro.h:

	/* 32-bit tick type on a 32-bit architecture, so reads of the tick count do
	not need to be guarded with a critical section. */
	#define portTICK_TYPE_IS_ATOMIC 1

That is not defined in Erich’s portmacro.h. That might explain why this hasn’t been a problem for more people - to trigger this bug you have to have portTICK_TYPE_IS_ATOMIC=0 and have runtime statistics turned on.

Setting portTICK_TYPE_IS_ATOMIC looks like it will let me remove my global interrupt mask, but not knowing the kernel better I’d worry that there might be other places where BASEPRI could get set wrong and it doesn’t feel like a solid fix to me.

Have you tried updating the portSET/CLEAR_INTERRUPT_MASK_FROM_ISR() macros to actually return the current basepri value rather than just hard code to return 0?

I’m not conversant enough in ARM/Thumb assembly to even read more than 70% of what’s there without a reference manual in front of me and I haven’t used gcc’s extended asm syntax before. I’m sure I could get it but I’d feel better if someone who knows what they’re doing showed me the right way, so I don’t wind up chasing down side effects. And if anyone knows a good, complete, concise reference/cheat sheet for the instruction set and assembly syntax, please let me know!

Thanks,

Scott

rtel wrote on Thursday, October 11, 2018:

So the scenario is like this:

  1. PendSV executes.
  2. Hardware sets priority in the NVIC to 15 (low priority used by
    interrupt handler).
  3. PendSV software sets basepri to 5 before calling vTaskSwitchContext().
  4. vTaskSwitchContext() calls the stats function, which calls
    portSET_INTERRUPT_MASK_FROM_ISR(), which also sets BasePri to 5 (which
    it already was anyway).
  5. When portCLEAR_INTERRUPT_MASK_FROM_ISR() is called software sets
    BasePri to 0 (unmasks everything) BUT actually basepri used within the
    hardware itself is the priority of the interrupt - which is 15 - and the
    critical section has been broken.

Makes sense.

n1vg wrote on Friday, October 12, 2018:

Yes, that’s my interpretation. The lack of a portTICK_TYPE_IS_ATOMIC setting seems to be an oversight in Erich’s port, but it ought to be a harmless one that just costs a few wasted cycles per context switch.

I have a few different ways I can avoid this for now. Setting it up so that it does retain the BASEPRI setting seems to me like the way to make it behave in the least-surprising fashion.

If that’s not where the fix is made, then the big question is whether there’s anywhere else that this is going to pop up.

The most surprising thing about this is that it apparently wasn’t a bug in my application code. Usually the rule is that the longer you spend hunting a bug, the more likely it is to be something absolutely trivial that someone else is going to spot the first time they take a look over your shoulder.

Now if SiLabs could just make their WiFi modules not randomly reboot, I could have a pretty solid product…

n1vg wrote on Friday, October 12, 2018:

Richard, do you need me to submit a bug report or anything? I have what I need for now, I just want to make sure this doesn’t bite anyone else.

rtel wrote on Friday, October 12, 2018:

I have it noted on the backlog, but probably best if you open a bug
ticket in SourceForge and point it to this thread so it doesn’t get lost
and others have visibility too - thanks.

n1vg wrote on Friday, October 12, 2018:

Will do!