Context/scheduler running prior to OS running

neilbradley wrote on Thursday, November 23, 2006:

I had a really annoying problem with FreeRTOS and my SAM7-256 board that I believe is a kernel bug. The issue I was having is that I needed system interrupts enabled before I was running the scheduler (to talk to various pieces of hardware), and I was occasionally getting bizarre register corruptions (value of 0xa5a5a5a5) after the first task was created. So about one out of every 10 times I turned on (or reset) the board, I’d get a data abort.

I traced the problem to vTaskIncrementTick and vTaskSwitchContext doing regular "processing" of tasks and switching out my existing pre-OS context while timer interrupts were enabled. vTaskSwitchContext checks for uxSchedulerSuspended, which by default is *FALSE*, causing it to (potentially) do a task swap outside of the OS running. Yikes!

My belief is that vTaskIncrementTick and vTaskSwitchContext should do absolutely nothing unless xSchedulerRunning is set to non pdFALSE. I added in:

    if (pdFALSE == xSchedulerRunning)

in vTaskIncrementTick and vTaskSwithcContext, and my bizarre startup problems have gone away. Currently those procedures don’t look at the xSchedulerRunning boolean. My initial thought that this was something I should put in the port side of things, but upon further thinking about it, it looks to me as if having those procedures run on ANY platform when the OS is running is a bad idea.

Richard, care to weigh in on this one and perhaps fix?


rtel wrote on Thursday, November 23, 2006:

Hi Neil,

This is one of those ones that has been bounced back and forth a bit - and I think I’m getting to be on the loosing side of the argument.

The reason the critical nesting count is initialised to a non-zero value is to prevent interrupts accidentally being enabled while API calls are being made prior to the scheduler being started, as on some architectures this is a bad thing.

The tick interrupt itself is not configured until the scheduler is started - and interrupts are disabled immediately prior to the tick being configured.  Therefore the idea is that tick interrupts are not processed until the first task starts (the task is started with interrupts enabled in the task context).

Problems can occur with other interrupts if prior to starting the scheduler you configure a peripheral - say USB for example - and that interrupt can cause a context switch.  One solution is to not configure the interrupts until the scheduler is started, but this is a long way from satisfactory.  It is easier on architectures where you can change the vector table at run time.  There you can use an ISR that does not cause a context switch prior to the scheduler being started, then have the task that uses the interrupt install a different ISR when the task first starts.

Personally I don’t like the idea of checking that the scheduler is running in each tick - its more overhead.  Also I’m not sure that it actually fixes the problem because, by the time the increment tick and switch context C functions have been called the task context will already have been saved.  The problem with this is that if the scheduler has not been started, or at least no tasks have been created, then the scheduler data structures will not have been initialised and the task context can be saved to a NULL address (there is no legitimate place to save the context).  pcCurrentTCB could still be NULL.

It would be possible to create a default TCB and stack (maybe create the idle task first thing rather than last thing and use that) so the context would have somewhere to go, and then initialise uxSchedulerSuspended to a non zero value.  This means the context would harmlessly be saved and restored to the area that would otherwise be used by the idle task.  When starting the scheduler, once interrupts were disabled, ucSchedulerSuspended could be reset to 0.


nobody wrote on Thursday, November 23, 2006:

Why not put a check in the peripheral ISR?

portEXIT_SWITCHING_ISR( ( xTaskWokenByTx || xTaskWokenByRx ) && xSchedulerRunning );

This will prevent the context being saved if the scheduler is not running.

neilbradley wrote on Thursday, November 23, 2006:

In my environment, I need to run with interrupts enabled, and unfortunately that timer is something I need to use (and I need interrupts enabled for that timer).

Regarding the “overhead” comment - um, it’s checking a single global variable. Even on 12 cycle 8051s, this is an insignificant amount of time. I respect your desire to keep things minimal and lightweight and fast, but come on! Checking a single variable every 5 milliseconds (or even a millisecond) is a negligible amount of time and won’t make any discernable difference.

Another problem I’m running into is that I need to write routines that are “special” - that is, I need to see if the scheduler is running before calling the operating system and either call semaphore gets/puts, or not (I SOMETIMES get data aborts right now if I call the queue receive routine when the scheduler isn’t running). Either those checks for the scheduler are going to have to be in the client code, or in the OS.

Yes, there are many other ways that I can solve this, but this is a straightforward, “one size fits all and saves everyone from the hassle” approach. This is a common behavior in all microkernels I’ve worked with - the function will return gracefully if the OS isn’t running, otherwise do the expected function.

Why not have the kernel handle this? Sorry to be a bit of a whiner, but I’ve burned from 1PM today to just under 1AM, almost nonstop, trying to figure out where my random data aborts, trashed stack, and otherwise VERY unpredictable timing dependent problems really are.


neilbradley wrote on Thursday, November 23, 2006:

Forgot to mention, I don’t mind if anything gets put on the stack because of the tick - that’s harmless. FreeRTOS was attempting to do scheduling, and it DEFINITELY has made a difference. Went from being intermittently unusable to (finally) reliably bootable.

rtel wrote on Thursday, November 23, 2006:

It’s not the saving of the context that is the problem - but the access to a TCB. 

During the context switch the new top of stack is written to the current TCB (pointed to by pxCurrentTCB).  The question is what value does pxCurrentTCB hold? 

If no tasks have been created it will be NULL.  An obviously bad situaltion.

If one or more tasks have been created (but the scheduler not necessarily started) then it will point to the TCB of the highest priority task created thus far.  In this case (I think, I would have to draw out the scenario to be sure) the context will be saved to your C stack harmlessly, but the top of the stack will be written to the TCB, effectively corrupting it.  When the task starts it will read the wrong top of stack from the TCB, pop junk into its registers, and then run using your C stack instead of the stack allocated to it.  This may or may not result in an error depending on how your system uses RAM.  If you are passing parameters to the task then the parameter pointer will definitely be corrupt.  The problem is this occurs before the extra test you placed in the C code.