GCC9 compiled code damage function local variable (related to VFP/FPU support)

We have found that GCC >= 9 uses ARM VFP registers to store local function
variables instead of storing them on stack. It looks like a smart idea, because accessing
the VFP register is faster than accessing a stack, but there is one serious
problem.
The compiler can produce code like this for ANY more complex function with a lot of local variables, including functions which doesn’t use float point in sources:

stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
vpush {d8}
… do something
vmov s16, r3 ; temporarily store a local varible to a VFP register
… do something (*)
vmov r1, s16 ; restore the local variable from a temporary VFP register
… do something
vpop {d8}
ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}

Problem is, that only some processes have a flag to save/restore FPU/VFP context (vPortEnableVFP function on ARM_CM4F or vPortTaskUsesFPU on ARM_CR5 and ARM_CA9)
When FreeRTOS saves/restores another process context on (*) location, and current process haven’t FPU/VFP enabled, then local function variable will be destroyed.

One possible work around is to enable FPU/VFP for ALL processes regardless whether they uses a float point unit. But it has a significant impact to used stack and context switching time.
Another workaround is to pass -mgeneral-regs-only option to the compiler, but it is not compatible with code, which regularly uses floating point operations.

See https://gcc.gnu.org/pipermail/gcc-help/2020-July/139112.html

1 Like

This might be useful:

It’s not really an elegant solution though. I would be interested to know if there is a better solution as have had this problem before.

Yep! It is same problem.

We are try to use __attribute__ ((general-regs-only)) already, but it just force compiler to throw lot of messages about ignored attribute. For some functions it’s works and for some not.
And I was not able to found why it is ignored.

Except this there are problem with LTO (link time optimization). GCC tell nothing how is generated assembler during link-time regarding general-regs-only setting forced by function attribute.

Second problem are pre-compiled libraries. It can contain same code which temporary store general register on VFP/FPU unit.

This has long been an issue on Cortex-A chips where the application writer has to say which tasks have a floating point context and which don’t, but I don’t understand why it would be an issue on a Cortex-M core as Cortex-M’s manage the floating point context for you. As soon as a task uses a floating point instruction it will automatically be given a floating point context.
That will increase your stack size, so make sure your problem is not just a stack overflow.

vPortEnableVFP() is only called once when the scheduler starts - application writers should not be using that function.

I assume accessing a flop register requires a floating point instruction, so you should then have a floating point context.

Which FreeRTOS port layer are you using? (which directory is it in?)

Problem is not in FreeRTOS. It is a generic problem of any RTOS which tries to separate tasks/processes based on their usage of VFP in combination with GCC9. It depends on the presence of VFP/FPU independently on the core (Cortex-A, R4F, R5F, M4F …)

GCC9 can by default use VFP/FPU registers anywhere to temporarily store general local variables instead of putting them on stack. This means, that user is not able to select which process needs VFP enabled.

We are not using FreeRTOS, but an in-house made RTOS (long story why). I tried to ask about this behaviour in gcc-help maillist and I have only one answer, where they have some doubts about sorting tasks to float-point and non-float-point. But I think that it is a common technique in embedded RTOSes because it improves task switching time and memory footprint seriously. Therefore I tried to check if FreeRTOS hase the same problem and found the same problem, and “nobody” knows about it! That was the motivation to create this topic.

EDIT: ARM-R4F is not disturbed

Kind of. You have to ensure that the FPU is physically enabled. However assuming that is the case FPU context is handled automatically in Cortex-M as the processor know whether it has a floating point context that needs saving. You just need to ensure you have enough task stack for it.

In the Cortex-R it is different. The task needs to know if it has a floating point context as the processor can not deduce this automitically.

EDIT: I don’t know about Cortex-M4(F) Lazy Stacking and Context Switching. This mean that ARM-M4F port is not impacted by this new GCC9 behavior.

You are right about ARM-CR. Cortex-R looks more like Cortex-A than Cortex-M. Major difference except security enhancements is MPU instead MMU (on R4 and R5).

But ARM-CA and ARM-CR ports are still disturbed.

It is not new to GCC9 though. Our Zynq (A9) ports have coped with this since they were written. There are two additional methods, first is to provide your own implementations of the functions that use floating point registers (which is what many of our demos do), second is to configure FreeRTOSConfig.h such that each task is given a floating point context when it is created.

Can you explain it better? We don’t have problems with floating point functions. Problem is that GCC9 can use floating point registers anywhere without any float point code in source (just as temporary storage for general local value instead of stack).
I am not sure if it is related to GCC>=9, but we are not able to reproduce same behavior on GCC<9.

This kind of optimization was indeed introduced some major versions before GCC >= 9. Maybe not for all processor back ends at the beginning…
I think you either should pay the price and globally enable FPU context switching for all tasks or try to build your application w/o HW FPU support and manually enable/use it where you need it e.g. by tagging math functions with the proper attributes and manage the calling task(s) accordingly if this is possible.

I thing that we should not pay the price and globally enable VFP/FPU. Let to introduce why: we have 20000~60000 context switches per second (and need really hard real time reactions, we must be faster than physics on regulated device). CPU kernel is Cortex-R5 @300MHz
Price to save/restore VFP unit is 20 cycles each (in optimistic case when CPU don’t wait for the data). It mean 800000~2400000 CPU cycles per second only for VFP context save. It represents 0.26 ~ 0.8 % of whole CPU power. Too much, we can reduce it more than 4/5 by sort processes to VFP / non-VFP and lazy VFP context save.

Wow - I hope you mean interrupts executing at that speed, and not task switches.

You have three choices:

  1. Give every task an FPU context.
  2. Use compiler switches that prevent the FPU registers being used outside of floating point operations.
  3. Write your own versions of the libraries to prevent library code using FPU registers.

Not. I mean really task switching, not interrupts. It is completely different system than FreeRTOS, another scheduler is probably biggest difference, we are using EDF. And EDF advantages is major reason why regulators runs as process, not interrupts. But we share same problem with unwanted VFP register usage as FreeRTOS.
add 2. We are try to sort sources to 2 groups. It is possible, but it is perfect chance to make problem which can’t be catch on our QA team.