CortexM3 FreeRTOS Oddity

sniperears wrote on Thursday, March 07, 2013:

As I had the pleasure of fighting this recently, I thought I’d post it in case it might help someone else:

The CM3 port implementation for Yield(FromISR) is as follows:

void vPortYieldFromISR( void )
{
	/* Set a PendSV to request a context switch. */
	*(portNVIC_INT_CTRL) = portNVIC_PENDSVSET;
}

The danger here is that let’s say you have a task that is only run once (init or something) and then deletes itself.  That task might look something like:

static void initTask( void * pvParameters )
{
   // Do important work
   ...
   // work is all done, so time to go away
   vTaskDelete(NULL);
}

If your compiler has optimizations turned on, it would probably call vTaskDelete with the Branch instruction instead of the BranchAndLink instruction (because in that stack frame there is nothing to return to - BTW this happens in IAR with optimizations set to medium or higher). 
Since the task is being deleted, the scheduler needs to do a context switch instead of return and therefore uses the vPortYieldFromISR Macro.  This is all well and good except for the fact that setting the PendSV bit doesn’t actually cause the interrupt, it just pends it and the NVIC decides when it’s appropriate.  This means that a few more instructions could get executed before the context switch happens.  Well, the next instruction after setting the PendSV bit is to return (branch to link register) - but the link register wasn’t set in this case, because the optimizer knew there was nowhere to return to.  This causes a fault.  If there were just one more instruction after the vTaskDelete, the BL instruction is used, and the fault is avoided even though the code is never executed. Also, if optimizations are set to low or none (at least in IAR) the BL instruction is used regardless.

rtel wrote on Friday, March 08, 2013:

Thanks for posting this, it is very interesting, and I am surprised this could happen.  I would have to check how many instructions could execute after a PendSV was implemented, I would not have thought there would be any over and above completing any atomic instructions used in the code generated by the compiler to set the PendSV bit.

In the case you highlight, is the vPortYieldFromISR() function being inlined?  Otherwise there would be instructions to return from vPortYieldFromISR() between the PendSV bit being set and vTaskDelete() reaching the end of its function.

Regards.

sniperears wrote on Friday, March 08, 2013:

Thanks for the reply Richard,
I too was surprised (and quite baffled for a while) by this.  To answer your question, Yes.  This is what the end of vTaskDelete looks like:

		...
		taskEXIT_CRITICAL();
		/* Force a reschedule if we have just deleted the current task. */
		if( xSchedulerRunning != pdFALSE )
		{
			if( ( void * ) xTaskToDelete == NULL )
			{
				portYIELD_WITHIN_API();
			}
		}
	}

Here is the chain of defines (aggregated for ease of readability):

#define portYIELD_WITHIN_API   portYIELD      // Scheduler layer
#define portYIELD()   vPortYieldFromISR()         // Port layer

So in the end it vTaskDelete() looks like this:

		...
		taskEXIT_CRITICAL();
		/* Force a reschedule if we have just deleted the current task. */
		if( xSchedulerRunning != pdFALSE )
		{
			if( ( void * ) xTaskToDelete == NULL )
			{
				portNVIC_INT_CTRL_REG = portNVIC_PENDSVSET_BIT;
			}
		}
	}

And so after the PendSV bit is set - Branch to Link with link register being 0 is attempted (and faults).

Cheers

davedoors wrote on Friday, March 08, 2013:

Putting to one side for a minute whether something in FreeRTOS needs to change if the setting of pendSV is inlined, does this example not also show a compiler bug? You say the compiler changes the branch instruction because it is not going to return, but then evidently it does try and return. Either the branch without link should jump to code that is completely self contained and never try to get back to where it jumped from or the branch with link should be used. The compiler does not know about any changes in execution paths that are brought about by FreeRTOS (context switches to other tasks), it only knows about execution paths expected by single threading C code.

sniperears wrote on Friday, March 08, 2013:

After re-reading my response I realize I was quite unclear.  No, vPortYieldFromISR() is **not **inlined -  it is implemented in port.c, but yes, the compiler is smart enough to call it with Branch instead of BranchAndLink. 
Hope that’s more clear.

sniperears wrote on Friday, March 08, 2013:

Good point Dave.  I agree.  Nevertheless, the optimizations really do produce this phenomenon - maybe it just does a blind jump hoping that execution path will change and expects a fault if it does not?  Isn’t it good when the compilers have hope? :slight_smile:

Just for the fun of it I did a screen capture of the disassembly (mixed with C) where this problem exists.

The block of code that follows the vTaskDelete is a vApplicationIdleHook which increments a count and calls SysCtlSleep().  Notice the instruction used to call SysCtlSleep().  A blind Branch.  And yet the implementation of SysCtlSleep is an assembly instruction WFI followed by a } which is a BX  LR.

SysCtlSleep:
       0xce46: 0xf004 0xbdbd  B.W       CPUwfi                  ; 0x119c4
CPUwfi:
      0x119c4: 0xbf30         WFI
}
      0x119c6: 0x4770         BX        LR

So apparently this happens everywhere.

sniperears wrote on Friday, March 08, 2013:

Image is here - how come you can’t edit here?

richard_damon wrote on Friday, March 08, 2013:

What the compiler is doing is a form of “Tail-Recursion” optimization where you can convert a call followed by a return with a jump (so when the function called does its return, it returns to the caller of the 1st function).

The problem here is that task function do not have a proper stack frame set up for returning, so when the vTaskDelete function executes its return, it  heads the processor into invalid space.

What is probably the best option would be to protect against this in vTaskDelete by putting some non-optimizable away operation after the Yield to make sure this doesn’t happen. Another would be to add a nop in the Yield operation in the port layer, but this adds a cycle to all yields.

rtel wrote on Friday, March 08, 2013:

Thank you for your insights richard_damen.  In other ports, where it is known that a software interrupt takes a few instructions to execute, or where branch delay slots are required (because the CPU executes the next instruction that is already loaded into the pipeline), I have added NOPs after the yield, or more recently read back from the software interrupt register to force the microcontroller to stall until the value is first written.

I will review this further in the documentation, and if necessary contact some of the ARM guys for their opinion, before deciding what to do, as I’m surprised it is necessary in this case.  I will let you know…

Regards.

richard_damon wrote on Friday, March 08, 2013:

If there is only 1 instruction of delay, I would think that guarding vTaskDelete would make more sense, as executing the return instruction should normally be harmless. If the processor might execute farther than that, then the Yield definitely needs to handle the delay.

The cause of the problem is that multiple levels of code have preformed tail optimization, so that the Yield is attempting to directly return to the caller of the task function. The stack frame for the task function was not set up to handle a return, so executing the return cause a fault. This is where adding some code after the Yield in vTaskDelete would help, as then VTaskDelete could not use tail optimization, and Yield would have a valid stack frame to return from.

This is NOT a compiler bug, as the compiler expects that ALL user functions have proper stack frames to return from, so tail optimization is valid. The other solution would be to initialize the stack frame for a task so that doing a return would be valid, perhaps going to a stub that calls vTaskDelete(0).

rtel wrote on Saturday, March 09, 2013:

Its seems that worst case up to two instructions past the PendSV setting can be set (depending on lots of things including wait states, etc.), and that adding a memory barrier after writing to the PendSV bit would fix this particular case.  The question then is, is it better to take the hit of a memory barrier instruction on every yield to fix a corner case when deleting a task.  I’m still researching this and haven’t concluded anything yet.

Regards.

mahavir wrote on Sunday, March 10, 2013:

Interesting discussion!

I think, I had seen this issue previously but was not aware (or reached to conclusion) that it was “tail call” optimization. Interestingly this was not reproducible with all gcc (Codesourcery arm-none-eabi-gcc) versions and hence was difficult to chase down. Neverthless, as Richard already mentioned we ended up fixing like below,


void vPortYieldFromISR( void )
{
        /* Set a PendSV to request a context switch. */
        *(portNVIC_INT_CTRL) = portNVIC_PENDSVSET;
        __asm volatile (“dsb”);
}

But I have couple of questions:

1. Is above fix the correct one, or rather optimum one?
2. What would be worst case impact (if any) on performance of system?

Thanks.

rtel wrote on Monday, March 11, 2013:

1. Is above fix the correct one, or rather optimum one?

Yes - that will fix the issue reported - but I’m not sure yet whether just adding two NOPs in place of the dsb would actually prove more optimal.

Regards.

rtel wrote on Tuesday, March 12, 2013:

I will probably add a DSB followed by an ISB into the yield function.  This is apparently very much overkill, but necessary to be within the specification for the Cortex-M architecture for all current and future parts (as some of the characteristics that dictate these things are implementation defined).

Regards.