Atomicity and `volatile`

system · March 17, 2016, 9:11pm

mrkline wrote on Thursday, March 17, 2016:

FreeRTOS seems to take an… interesting approach to atomicity. Code in tasks.c and list.h use volatile in an apparent attempt to make some code safe from compiler optimizations that would “break” said code during inopportune context switches (either to another task or to an interrupt “beneath” the OS). The smoking gun is the existence of configLIST_VOLATILE and its comment,

The list structure members are modified from within interrupts, and therefore by rights should be declared volatile. However, they are only modified in a functionally atomic way (within critical sections of with [sic] the scheduler suspended) and are either passed by reference into a function or indexed via a volatile variable. Therefore, in all use cases tested so far, the volatile qualifier can be omitted in order to provide a moderate performance improvement without adversely affecting functional behaviour. The assembly instructions generated by the IAR, ARM and GCC compilers when the respective compiler’s options were set for maximum optimisation has been inspected and deemed to be as intended. That said, as compiler technology advances[…] it is feasible that the volatile qualifier will be needed for correct optimisation[…] If this is ever experienced then the volatile qualifier can be inserted in the relevant places within the list structures by simply defining configLIST_VOLATILE to volatile in FreeRTOSConfig.h (as per the example at the bottom of this comment block).

This, along with a linter note in tasks.c that, “A manual analysis and inspection has been used to determine which static variables must be declared volatile.”, gives the impression that FreeRTOS authors believe declaring a variable volatile makes it safe to use in multiple, concurrent contexts. This misconception is as dangerous as it is old. volatile only guarantees that the compiler will not optimize away reads and writes to a given value, nor perform speculative reads ahead of time. It is only useful for accessing “special” addresses, such as MMIO. For lockless interaction between different contexts of execution, a set of ordering guarantees are also required, which volatile simply does not provide.

This is so crucial to systems programming that the 2011 ISO standards for C and C++ spend great effort to define atomic semantics and the orderings needed (see http://en.cppreference.com/w/c/atomic). Unless I am mistaken, the only reason FreeRTOS doesn’t fall apart during untimely context switches is because most platforms which run it are single-core microprocessors where a write to an address is not buffered or cached in some way. In these cases, volatile has been “good enough” to stop breaking optimizations. However, this is a very poor safety net. Some architectures supported by FreeRTOS have instructions needed to properly order lockless code. (see ARM’s DSB). These are emitted by C11 atomic reads and writes, but not by volatile. And as the comments for configLIST_VOLATILE note, optimizers could quickly ruin the assumptions FreeRTOS makes about these accesses.

The obvious solution would be to use the C11 facilities for variables shared locklessly across several contexts of execution, but given that FreeRTOS supports many compilers (several of which can be charitably described as “stable” and uncharitably described as “archaic”), a different approach may be needed. One pre-C11 approach is to create port-specific inline functions or macros that perform various atomic operations (load, store, increment, decrement, compare and swap, etc.). This is the approach taken by Linux. At any rate, why is volatile being used in these places?

system · March 17, 2016, 9:33pm

edwards3 wrote on Thursday, March 17, 2016:

At any rate, why is volatile being used in these places?

?

To ensure variables that are accessed and updated from more than one thread of execution are updated and written back to memory atomically as seen from other threads and so are always in a consistent state before another thread of execution accesses the same variable. The atomic behavior is achieved by the fact of having the updates performed in one form of critical section or another which is portable for every compiler and architecture.

system · March 18, 2016, 12:01am

mrkline wrote on Friday, March 18, 2016:

The atomic behavior is achieved by the fact of having the updates performed in one form of critical section or another which is portable for every compiler and architecture.

Okay, and looking at the ports for architectures with multiple cores or out-of-order execution, I can see that critical sections are guarded with memory fences (DSB on ARM, to continue using that example). But if the memory barriers ensure that writes are “seen” by other threads of execution once leaving the critical section, what is the use of volatile? For reads on said values?
If so,

Reading a volatile value doesn’t perform an acquire operation to ensure writes happened before the load from the persepctive of the current thread. (see http://en.cppreference.com/w/c/atomic/memory_order#Release-Acquire_ordering)
volatile does not necessarily guarantee a “correct” ordering when reading a value outside a critical section. (One reading of the standard is that ordinary loads can move across a volatile load/store in either direction, but ordinary stores can’t. - see sources below)
voltaile can cause pessimizations unneded by correct lockless code.

Additionally, memory fences are quite pessimal on architectures that have loads and stores with built-in acquire and release semantics, such as ARMv8’s LDRA and STRL. (FreeRTOS supports several such chips.)

In short, volatile is the wrong tool for atomicity and concurrent memory access.

See:

Herb Sutter’s atomic<> Weapons talk (Part 1, Part 2. While the talk is on C++, C has the same semantics (the ISO C people lifted it from the C++11 standard, more or less.)
Atomic operations library - cppreference.com

rtel · March 18, 2016, 9:56am

rtel wrote on Friday, March 18, 2016:

This has been debated a lot, but not for a long time. The discussions are always interesting, and thanks for the links, I will read through them. We have also taken opinions from compiler vendors.

The first thing to do is limit the discussion to the FreeRTOS use cases, as follows, it is very different to Linux!:

Single core. FreeRTOS is used more and more in multi-core, but predominantly in AMP mode, where any shared memory is treated specifically (cache flushes, etc.). The SMP version of FreeRTOS was created by a third party, so not discussed here.
Historically, and still predominantly, on cores that do not perform out of order operations. This is changing quickly though, and as you point out, where appropriate memory barriers are used in the portable layer, rather than in the core code, where necessary.
Supporting some 20+ ‘compilers’, where in many cases a 90’s version of C is a big ask, let alone C11, some aren’t really C compilers at all. Add into the mix that on at least one occasion a premium compiler has required a volatile as a fix to an optimisation bug even when there was no conceivable ‘normal’ reason for volatile to be used.
Supporting some very basic processor architectures.
Kernel data structures can change at any time, including directly from inside interrupts.
The functions are genrally very short.

The sequence:

Enter_critical_section();
Update_volatile_variable();
Exit_critical_section();

is taken as a ‘best compromise’ approach that meets all the above constraints in a very portable way. The variable is volatile to ensure it gets writen back to memory given the constaints above (especially point 5), and barriers can be added to the Exit_critical_section() function on architectures that need it without upsetting ports that have no equivalent.

Now the disadvantage of the above is that all accesses to the variable within the critical section should be written back to memory, which is sub-optimal, as it is only absolutely necessary when the critical section is exited. In most cases this not deemed an issue as the variable may only be accessed once where reading it from memory modifying it and writing it back to memory is exactly what is wanted, and generally critical sections are very small. Some optimisations have been included where this is not the case - for example where the tick count is read more than once a local const version is created (it cannot be updated in the critical section, or when the scheduler is locked) so the compiler only takes a single bite at it.

On architectures that support special atomic instructions, or using compilers that support the latest and greatest standard (?), some of the volatiles could be removed safely, and perhaps it is time to refactor accordingly, but to do so would require the introduction of a port specific volatile keyword so volatile and port specific volatiles could be mixed and matched within the same code. However, performance gain would be minimal given the short function lengths (the data still has to be read from memory, modified, then written back to memory, then fenced if required).

This page has been referenced a few times…

https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt

…which gives the example:

    spin_lock(&the_lock);
    do_something_on(&shared_data);
    do_something_else_with(&shared_data);
    spin_unlock(&the_lock);

…but this is not something that is going to happen in a system where only one thing can be executing at a time. Spinning in that situation is not going to work if the thing you are waiting for to unlock a resource cannot execute because you are spinning. In FreeRTOS, and other similar systems, if a resource is locked then the currently executing thread must tell the scheduler it cannot proceed until the lock is released or a timeout occurs, at which time the scheduler will stop executing it.

Thoughts?

hs2 · March 18, 2016, 10:59am

hs2sf wrote on Friday, March 18, 2016:

OS things will surely get much more complicated with much more complicated e.g. multi-core processors. Also it’s pretty hard to fully support the variety of processor architectures and associated compilers. As far as I can see this doesn’t really apply to the currently, officially supported systems.
Nethertheless it’s desirable to make use of enhancements some compilers do support today for some architectures. Hence I’m in contact with FreeRTOS concerning C(++)11 (std::)atomic support for 32 bit ARM / GCC4.9.3 environments. As mentioned by Matt this would allow implementing lockless data structures/algorithms for better performance/responsiveness than using higher order locking mechanisms.
I hope the fairly simple patch gets integrated into V9. Well, realistically maybe in V9.1

system · March 21, 2016, 9:06pm

mrkline wrote on Monday, March 21, 2016:

This page has been referenced a few times…
https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
…which gives the example:

spin_lock(&the_lock);
do_something_on(&shared_data);
do_something_else_with(&shared_data);
spin_unlock(&the_lock);

…but this is not something that is going to happen in a system where only one thing can be executing at a time.

The important point here is not that the code is surrounded by spinlocks (which as you correctly point out, are useless* in a single-core machine with interrupts disabled), but that the spinlocks act as memory barriers. If you follow spin_lock through a few layers of macros, you find it calls barrier, which tells the compiler not to make assumptions about the current values in memory. In GCC, this is done with asm volatile("" ::: "memory"); (see https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers). By doing this instead of making all variables in the critical section volatile, the compiler is free to perform its normal optimizations within the critical section, but not make assumptions about a memory location’s contents before or after. I haven’t read much of Linux’s internals myself (nor do I have an ARMv7 build of Linux sitting around to disassemble), but I assume spin_lock also provides hardware memory fences (i.e. special instructions, such as ARM’s DSB) on systems that need it.

Are variables modified inside critical sections accessed outside of one? If so, hardware memory fences may be needed on out-of-order or multiprocessor machines to ensure that all writes from thread A are “seen” by thread B when thread B performs its reads. This is where I am more concerned - using volatile variables inside the critical section is pessimal, but provides the needed ordering guarantees (by virtue of being in the critical section). Any reads outside the critical section, however, are not provided the needed ordering guarantees by virtue of being volatile.

Also, it seems the C++ article for the C11/C++11 memory model explains things a bit more; hopefully you find it helpful: http://en.cppreference.com/w/cpp/atomic/memory_order

* It’s also worth noting that spin_lock seems to expand to a critical section on single-processor machines: http://lxr.free-electrons.com/source/include/linux/spinlock_api_up.h#L58