Memory barriers in FreeRTOS

system · February 22, 2017, 9:10am

orifai01 wrote on Wednesday, February 22, 2017:

Hello,
Are there equivalent functions to Linux wmb() and rmb() in FreeRTOS?

Thank you

system · February 22, 2017, 12:27pm

richarddamon wrote on Wednesday, February 22, 2017:

FreeRTOS is built with an assumption of a single processor model (so that a disable interrupt provides proper mutual exclusion to system structures), and so doesn’t have aneed at the sofware levl for such a primative.

There still might be a need at the hardware level, but its operation will be hardware dependant and OS agnostic, so it will be up to the compiler to provide such a primative, when it is needed.

For many of the processors that FreeRTOS is targeted, there really is no need fdor such a thing, as they don’t have a cache sofisticated enough to delay writes, and thus need the barrier.

rtel · February 22, 2017, 3:42pm

rtel wrote on Wednesday, February 22, 2017:

In addition to Mr Damon’s reply: FreeRTOS does use memory barriers
internally, where necessary (for example after writing to hardware
registers to enter sleep mode, etc.), but that is all.

system · February 22, 2017, 4:06pm

davidbrown wrote on Wednesday, February 22, 2017:

The rmb() and wmb() functions (macros, actually) in Linux are essential in single processor systems too. For simple enough systems, they are defined as:

asm volatile("" ::: "memory");

What this means is that any reads or writes of memory will not be moved across this memory barrier, nor will the results of such reads or writes be cached across the barrier. You typically don’t need such barriers in normal C, but if you need control of when data is read or written with respect to other accesses (interrupts, pre-emptive scheduled threads, DMA, etc.) then such barriers are a simple and cheap way to enforce ordering. It is an alternative to making all the memory accesses volatile.

I expect that many FreeRTOS functions include memory barriers already.

system · February 23, 2017, 4:40am

richarddamon wrote on Thursday, February 23, 2017:

Your case for the single processor isn’t a OS issue, but a compiler optimization issue. An RTOS will not change the order of execution of actions within a task/thread.

There is NOTHING for the OS to do about this in this case.

system · February 23, 2017, 8:03am

davidbrown wrote on Thursday, February 23, 2017:

That is true, but it is still a reasonable question to ask - since other OS’s provide such macros. The OS needs such compiler barriers in its implementation, and already has to have some sort of abstraction for them since the details vary according to compiler - it is perfectly natural to think that the same functionality could be exposed to the user.

Actually, having had a quick look at the port.c and port.h code for critical sections in the ARM_CM4F port (as an example), I can see that the code does not have the correct barriers where it needs them. It relies on using vPortEnterCritical and vPortExitCritical as functions, which will work as a compiler memory barrier as long as the compiler cannot see those function definitions when calling them. If the code is compiled with link-time optimisation, that changes - and the compiler can move memory accesses around calls to vEnterCritical and vExitCritical.

rtel · February 23, 2017, 4:09pm

rtel wrote on Thursday, February 23, 2017:

Actually, having had a quick look at the port.c and port.h code for
critical sections in the ARM_CM4F port (as an example), I can see that
the code does not have the correct barriers where it needs them.

If you think there is an error somewhere, please be specific as to
where, so it can be discussed, and corrected if necessary.

The code used to mask interrupts in the M4 port is shown below. It
includes memory barriers. Please let me know your thoughts on why this
is incorrect:

portFORCE_INLINE static void vPortRaiseBASEPRI( void )
{
uint32_t ulNewBASEPRI;

   __asm volatile
   (
     "  mov %0, %1            \n"  \
     "  msr basepri, %0       \n" \
     "  isb                   \n" \
     "  dsb                   \n" \
     :"=r" (ulNewBASEPRI) : "i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )
   );
}

system · February 27, 2017, 7:59am

davidbrown wrote on Monday, February 27, 2017:

Hi,

It is possible that this has drifted a bit from the original poster’s question - if you would prefer to discuss this by ordinary email, that’s fine by me. It might also be worth asking on the gcc-help mailing list to get the opinions of the actual compiler developers here.

My concern here rests on three points:

In C, sequence points and other ordering is only “as if” - the generated code must act “as if” it followed the C rules of sequencing, with respect to “observable behaviour”. “Observable behaviour” includes volatile memory accesses, file I/O, other <stdio.h> I/O, and program start/stop. Calling a function whose source and implementation are not known when compiling a file effectively acts as “observable behaviour” because the compiler does not know if the function has any /real/ observable behaviour. And some compiler extensions, such as “volatile” inline assembly in gcc, are also considered “observable behaviour”.
Ordering of other aspects of the language is not required to follow the ordering of the abstract machine. In particular, any reads or writes to memory can be re-arranged with respect to volatile memory accesses - it is only the volatile accesses that are ordered.
When using more advanced compilers and optimisations, the compiler can use knowledge of called functions to re-arrange code. In particular, it can use link-time optimisation to see across different translation units.

The compiler does not understand the contents of assembly statements. It uses the output, input, and clobber sections, along with the “volatile” keyword, to learn about them. When you have inline assembly such as the given vPortRaiseBASEPRI function, the “volatile” forces the assembly to be ordered with respect to other volatile accesses. (The input and output sections can also enforce certain ordering.) But it does /not/ order the code with respect to ordinary memory accesses or other code.

A user may have code like this:

uint64_t x;
uint64_t y;

...
taskENTER_CRITICAL();
// Nothing else can interrupt this,
// so the 64-bit accesses will be atomic
x = y;       
taskEXIT_CRITICAL();
...

This is a natural interpretation of entering or exiting a critical code region. But the code does /not/ guarantee it. We can simplify taskENTER_CRITICAL as:

asm volatile...
uxCriticalNesting++;

And taskEXIT_CRITICAL as:

uxCriticalNesting--;
if (uxCriticalNesting == 0) {
    asm volatile...
}

From the definitions of the volatile assembly, the /only/ things the compiler sees as important for order is the ordering with respect to other volatile accesses, and of course it cannot change the order of instructions within the asm volatile statements themselves.

So the user code is actually:

...
asm volatile... // enter
uxCriticalNesting++;
x = y;
uxCriticalNesting--;
if (uxCriticalNesting == 0) {
    asm volatile... // exit
}

The compiler can re-arrange /any/ of this with respect to the volatile assembly. It can start by incrementing uxCriticalNesting, then decrement it (it can also omit this entirely). Then it might read the low half of y and write it to the low half of x. Perhaps then it execute code like:

if (uxCriticalNesting) {
    asm volatile... // enter
} else {
    asm volatile... // enter
    asm volatile... // exit
}

And then it will copy the top half of y into the top half of x.

I think the user would be rather surprised to see this happen - but it is all legal for the compiler.

It is even worse for things like taskYIELD - the user would expect normal writes before a taskYIELD to be completed before a task switch!

Now, it is fair to say that the FreeRTOS code has worked fine so far - I doubt if you have seen real cases of this kind of re-ordering. There are two things that save you here - one is that the compiler cannot normally see “inside” function calls like taskYIELD when they are defined in a different file, and the other is that compilers don’t re-arrange code awkwardly just for fun - they only do it if there is a performance gain to be had. But with link-time optimisation making large pieces of code look like a giant inlined function, and processors with lots of registers to keep local data around instead of writing it out to memory - you /will/ see problems. And like all subtle problems in multi-tasking systems, they are going to be seriously unpleasant to find because code will work as expected in most cases.

If you believe me that this is potentially a real problem, even on single cpu systems, then thankfully the fix is extremely simple (in gcc and clang at least - I can’t answer for other compilers). Just add a “memory” to the clobber list of your inline assembly, or add explicit barriers:

#define barrier() asm volatile ("" ::: "memory")

The memory clobber tells gcc that this assembly may read or write memory in a way the compiler does not know about (through the input and output parts of the asm statement).

Be generous about adding such barrier() statements in your code. It will keep things a lot safer, and mean that common users’ assumptions about things like critical sections will be correct.

(C11 gives other ways to make such fences or enforce such ordering, but that won’t help the FreeRTOS code much!)

mvh.,

David

rtel · February 27, 2017, 5:00pm

rtel wrote on Monday, February 27, 2017:

Hi David - thanks for taking the time to provide this analysis. There
is a bit much there for a quick response but I am digesting the info.

system · April 18, 2017, 11:59pm

glenenglish wrote on Tuesday, April 18, 2017:

David, good stuff.
Has there been any more discussion on this before I do a full code review ?

I avoid -O3 like the plague, and only use it for spot functions. -O3 breaks my FreeRTOS devices, at the time I just guessed an excessivley aggressive compiler generating WILDLY unpexpected reordering, and just stay with -O2. I’ll have to investigate more when I have time one day…
glen.

rtel · April 19, 2017, 7:31am

rtel wrote on Wednesday, April 19, 2017:

The head revision in SVN has added :::“memory” to many additional places.

system · April 19, 2017, 7:53am

davidbrown wrote on Wednesday, April 19, 2017:

I haven’t had a look at the code for a while, but if my post about memory barriers was helpful, then I’m happy.

In theory, baring compiler bugs (which should be rare, but do exist), correct code will not be broken (except possibly timing or space requirements) by changing optimisation options. In practice, I have seen it a great many times. Sometimes this is simply due to programmers not understanding C properly. But sometimes it is due to the limitations of C in expressing the needs of the programmer. And sometimes it is due to the type of code in question being very difficult to get right.

In the case of FreeRTOS, I guess it is a combination of the second two reasons. You can’t write an RTOS in pure C - it needs some assembly, and it needs implementation-dependent behaviour and compiler extensions. It is particularly difficult to write code that is as portable as possible in such circumstances. But a generous helping of memory barriers can certainly help!

One thing to remember about optimisation levels for compilers is that they are not commands - they are hints. A compiler is free to use any valid optimisation technique regardless of the command line settings. So if your code works with -O2 and not -O3, then perhaps with the next version of your toolchain it will also break with -O2 or -O1.

system · April 19, 2017, 9:34am

glenenglish wrote on Wednesday, April 19, 2017:

Hi David
Yeah, I think if something doesnt work under O3 but does on O2, its a case of “you got away with it once”. IE strictly speaking, and under critical analysis you were sloppy. But only very strictly. OK on the commands/hints. I was not aware of that behaviour. But I find tracking down -O3 related optimizer side effects to be quite difficult.

system · April 19, 2017, 10:33am

davidbrown wrote on Wednesday, April 19, 2017:

Yes, figuring out what went wrong when you changed optimisation levels can be very difficult. And with more advanced optimisation, there can be a bit of luck involved - seemingly irrelevant changes can lead to the compiler picking a differenent balance for when a function is inlined or code sequencing is re-arranged, and suddenly there is a difference in what works and what does not work. The kind of things that cause such problems are often quite subtle.

Sometimes it is possible to track the issue by manually enabling or disabling different optimisation flags. With gcc, the -O flags mostly control groups of individual optimisation passes, which can be enabled or disabled somewhat independently. (I am being deliberately vague here, because reality is not quite as simple as this suggests.) A bit of trial and error can give you clues as to what might have failed.

Of course, when the error is an accidental race condition that only turns up once in a blue moon - or when the customer is demonstrating the system in front of his boss - you probably just want to disable the higher optimisations for now, and add a “fixme” comment for the future

system · April 8, 2019, 11:16pm

mhans2 wrote on Monday, April 08, 2019:

I just encountered this issue in our application on ARM Cortex-M7 with GCC 8.2.0. The code is something like this:

void modifyLinkedList(List *list) {
    vTaskSuspendAll();
    list->prev = last;
	list->next = nullptr;
    // modify the list some more 
    xTaskResumeAll();
}

Our linked lists would sometimes get corrupted which caused the watchdog timer to eventually trigger. Eventually we tracked the issue down to the function above. Inspection of the assembly revealed that GCC’s link time optimization had inlined vTaskSuspendAll() and moved the incrementing of uxSchedulerSuspended a few statements down. Now the code was the equivalent of the following:

void modifyLinkedList(List *list) {
    list->prev = last;
	list->next = nullptr;
     vTaskSuspendAll();
    // modify the list some more 
    xTaskResumeAll();
}

This means that the critical section did not start where we expected it to start. Adding a compiler barrier to the end of vTaskSuspendAll() and the beginning of xTaskResumeAll() fixes it.

I’d like to suggest that something like this should be added to FreeRTOS.

system · April 9, 2019, 12:36am

mark1122 wrote on Tuesday, April 09, 2019:

Good stuff in this thread. I use Intel’s NiosII in the FPGA a lot. I guess Hans’ code would need 2 "volatile asm(“sync”) after the suspend and before the resume. I would watch the assmebly output a lot next time when my code breaks.

rtel · April 9, 2019, 3:25am

rtel wrote on Tuesday, April 09, 2019:

vTaskSuspendAll() is just an extremely simply C function that does
nothing but increment a volatile variable. As the code is at the moment
there is nowhere a barrier can be placed in a compiler independent way -
all compiler independence is isolated into the port layer. We could
introduce a new macro that evaluates to a barrier for Cortex-M and is
empty if left undefined.

rtel · April 9, 2019, 3:38am

rtel wrote on Tuesday, April 09, 2019:

Can you confirm that adding “__asm volatile(”" ::: “memory”);" after the
variable increment (inside vTaskSuspendAll()) fixes the issue for you
(or not, as the case might be). Thanks.

system · April 9, 2019, 10:46am

richarddamon wrote on Tuesday, April 09, 2019:

I suppose the problem is due to the fact that the C Standard only specifiecs that order of actions is preseved with volatile (and only volatile to volatile, not volatile to non-volatile). It used to be that compilers weren’t ‘smart’ enough to peek into functions to confirm that they don’t do anything that the Standard says forces an ordering, but now some of them do.

While the later versions of the Standard provide ways of specifying that, FreeRTOS probably can’t assume that you have that new of a compiler, so it probably makes sense for the port layer to define a macro that implements such a barrier (if needed, old compilers where a function call is good enough might not need anything), and then the various forms of critical sections could include use of that macro right after starting the section, and just before ending it.

system · April 9, 2019, 8:35pm

mhans2 wrote on Tuesday, April 09, 2019:

I can confirm that this works.