M4F support for 'double' in multiple tasks (GCC, newlib)?

dnadler · July 17, 2020, 9:42pm

I understand context-switch mechanism for float (using single-precision hardware and associated lazy-save mechanism). Is there any mechanism for thread-safety between tasks using double (software library) - ie, is this included in newlib context switching?
Thanks!
Best Regards, Dave

rtel · July 17, 2020, 10:12pm

Grateful if you can provide more context. How does the compiler handle doubles if not using the floating point registers? If I know how doubles are represented and stored I should be able to answer the question.

richard-damon · July 18, 2020, 12:11am

Generally software based floating point doesn’t have that much of an issue unless the library uses static locations for temporaries. Most of the libraries I have seen (at least for ARM) use register pairs to store doubles (and spill to stack for longer storage), so these get saved as part of a normal context save. If a library did use static variables, the library would need to support putting them in ‘Thread Local’ storage. Newlib tends to be good at this.

Note that the low level library tends to naturally register based, as the basic operations take one or two parameters in register pairs, and return normally a single result (also in a register pair). It would be ‘higher’ level operations that might want more storage, but for ARM, the stack is so much easier to use than static memory locations, that unless the library needed some storage between function calls, it won’t need static memory (and such a function would be unsafe even with hardware floating point).

dnadler · July 18, 2020, 12:12am

@Richard-Barry If I knew I wouldn’t be asking

Background: Other platforms/tool-chains emulate the FP hardware. For example an 8086 tool-chain I used generated FP instructions and emulated the FP coprocessor. Consequently thread-safety for FP required saving and restoring the entire emulated coprocessor context. I wrote an OS for such a beast, and it was too expensive to do the save-restore but I could at least swap the trap vectors to fault if other than the desired thread used float.

I don’t know what GCC for M4F does about doubles (float, single-precision FP uses the hardware processor, and context switch does save/restore and even permits FP in ISRs). Doubles might be handled entirely on the stack, in which case there’s nothing to be done to provide thread-safety for doubles.

Did I adequately explain my confusion?
Thanks Richard!
Best Regards, Dave

dnadler · July 18, 2020, 12:17am

@richard-damon - Looks like we were typing at the same time. Do you have any idea how GCC8 for M4F supports double?
Thanks!
Best Regards, Dave

richard-damon · July 18, 2020, 12:19am

Yes, the x86 is register poor, and thus a floating point emulator library likely want to implement a floating point stack in memory (the x86 also addresses absolute memory locations easier than the ARM). Such an implementation needs that stack to either be saved on a context switch or better make it part of the thread local storage. The ARM was designed with many more registers, so tends to use them for the floating point. They also don’t try to do it via an ‘emulation’ layer where the code acts like it has a co-processor, but the co-processor is actually just emulated, you need to compile for the targeted processor, so it knows if you have the floating point hardware or not. This does mean that if you move the code to a machine with floating point hardware and run the code, it won’t use the hardware unless you recompile for that processor.

richard-damon · July 18, 2020, 12:20am

I can’t say I know for certain, but I do believe it is register based. A good check would be to step through a bit of code that does double arithmetic and see what code was generated.

rtel · July 18, 2020, 2:41am

I would be very surprised if the compiler did something you had to worry about when using double precision. Only once, in the 40+ architectures ported to, have we ever seen global statics being used to hold temporary value during mathematical calculations, and this was on a tiny 8-bit processor that never expected to be using multithreading. If you can post the assembly code generated for a simple double floating point calculation then we could determine for sure.

hs2 · July 18, 2020, 9:28am

As Richard said the model used e.g. by GCC is not emulating a dedicated co-processor HW but do the (double) math in SW. Except the runtime hit this might require a bit more stack.
I also think that the soft-algorithms are non-recursive or bounded at least to avoid surprises regarding stack usage and I’m convinced that the math routines are not stateful. I can’t see any need for that, too.
In short using software (double) FP these days is a no-brainer regarding OS / context switching even if it’s mixed with (partial) HW supported FP math.

aggarg · July 19, 2020, 10:09pm

All the results below are on the STM32L475 with ARM GCC compiler. If I use float, the following code is generated:

00000000 <StartDefaultTask>:
   0:   b580            push    {r7, lr}
   2:   b086            sub     sp, #24
   4:   af00            add     r7, sp, #0
   6:   6078            str     r0, [r7, #4]
   8:   f04f 537f       mov.w   r3, #1069547520 ; 0x3fc00000
   c:   617b            str     r3, [r7, #20]
   e:   4b0f            ldr     r3, [pc, #60]   ; (4c <StartDefaultTask+0x4c>)
  10:   613b            str     r3, [r7, #16]
  12:   ed97 7a05       vldr    s14, [r7, #20]
  16:   edd7 7a04       vldr    s15, [r7, #16]
  1a:   ee77 7a27       vadd.f32        s15, s14, s15
  1e:   edc7 7a03       vstr    s15, [r7, #12]
  22:   ed97 7a03       vldr    s14, [r7, #12]
  26:   edd7 7a05       vldr    s15, [r7, #20]
  2a:   ee77 7a67       vsub.f32        s15, s14, s15
  2e:   eeb7 7a08       vmov.f32        s14, #120       ; 0x3fc00000  1.5
  32:   eef4 7ac7       vcmpe.f32       s15, s14
  36:   eef1 fa10       vmrs    APSR_nzcv, fpscr
  3a:   d503            bpl.n   44 <StartDefaultTask+0x44>
  3c:   2001            movs    r0, #1
  3e:   f7ff fffe       bl      0 <osDelay>
  42:   e7e6            b.n     12 <StartDefaultTask+0x12>
  44:   2002            movs    r0, #2
  46:   f7ff fffe       bl      0 <osDelay>
  4a:   e7e2            b.n     12 <StartDefaultTask+0x12>
  4c:   40200000        .word   0x40200000

The above code uses stack, general purpose registers and floating point registers. Both the set of registers are stored/re-started on context switch and so there should be no problem.

If I use double, the following code is generated:

  00000000 <StartDefaultTask>:
   0:   b590            push    {r4, r7, lr}
   2:   b089            sub     sp, #36 ; 0x24
   4:   af00            add     r7, sp, #0
   6:   6078            str     r0, [r7, #4]
   8:   f04f 0300       mov.w   r3, #0
   c:   4c15            ldr     r4, [pc, #84]   ; (64 <StartDefaultTask+0x64>)
   e:   e9c7 3406       strd    r3, r4, [r7, #24]
  12:   f04f 0300       mov.w   r3, #0
  16:   4c14            ldr     r4, [pc, #80]   ; (68 <StartDefaultTask+0x68>)
  18:   e9c7 3404       strd    r3, r4, [r7, #16]
  1c:   e9d7 2304       ldrd    r2, r3, [r7, #16]
  20:   e9d7 0106       ldrd    r0, r1, [r7, #24]
  24:   f7ff fffe       bl      0 <__aeabi_dadd>
  28:   4603            mov     r3, r0
  2a:   460c            mov     r4, r1
  2c:   e9c7 3402       strd    r3, r4, [r7, #8]
  30:   e9d7 2306       ldrd    r2, r3, [r7, #24]
  34:   e9d7 0102       ldrd    r0, r1, [r7, #8]
  38:   f7ff fffe       bl      0 <__aeabi_dsub>
  3c:   4603            mov     r3, r0
  3e:   460c            mov     r4, r1
  40:   4618            mov     r0, r3
  42:   4621            mov     r1, r4
  44:   f04f 0200       mov.w   r2, #0
  48:   4b06            ldr     r3, [pc, #24]   ; (1c <__aeabi_dcmplt+0x1c>)
  4a:   f7ff fffe       bl      0 <__aeabi_dcmplt>
  4e:   4603            mov     r3, r0
  50:   2b00            cmp     r3, #0
  52:   d003            beq.n   5c <StartDefaultTask+0x5c>
  54:   2001            movs    r0, #1
  56:   f7ff fffe       bl      0 <osDelay>
  5a:   e7df            b.n     1c <StartDefaultTask+0x1c>
  5c:   2002            movs    r0, #2
  5e:   f7ff fffe       bl      0 <osDelay>
  62:   e7db            b.n     1c <StartDefaultTask+0x1c>
  64:   3ff80000        .word   0x3ff80000
  68:   40040000        .word   0x40040000

The above code uses stack, general purpose registers and functions like __aeabi_dadd and __aeabi_dsub. I traced the definition of these functions as well and they also use only stack and general purpose registers. So the context switch code will work in this case as well.

As Ricahrd mentioned, if you can share the assembly generated for your platform, we can be double sure.

Thanks.

dnadler · July 19, 2020, 10:27pm

Right, but the problem is that the language definition (well, in some C/C++ corners and variants) assumes a stateful floating-point processor. This to control things like rounding modes, and also for managing floating-point exceptions. See <fenv>. Now, for wee Arms, it looks like GCC punts for at least the soft float implementations - punting is apparently permitted by the standard. I have no idea what happens when you’ve got hardware float and software double; I don’t know if the standard addresses this. Here’s some recent discussion on the newlib mailing list about fenv support. Perhaps its best to avoid dark corners.
Thanks for the explanations!
Best Regards, Dave

hs2 · July 20, 2020, 9:26am

I think it’s rather a library/POSIX thing than the core language.
And I fully agree, just try to avoid the dark corners.
Usually you can mix float and double math with just single precision FP HW support w/o any problem.

richard-damon · July 20, 2020, 10:44am

The language also allows the implementation to ‘punt’ on most of those features, as long as they document that (and only somewhat recently even admitted to ‘multi-threaded’ code). Yes, to support those features the memory to store those options needs to be ‘Thread Local’ (which FreeRTOS supports) or for hardware, you would need to save that hardware state as part of the task context (increasing the overhead to using Hardware Floating Point in a task).