Zynq Ultrascale MPSoC task floating point corruption

Hello,
I’m running FreeRTOS 9.0.1 on a Xilinx ZCU102 evel board (with port ARM_CA53_64_BIT.) I am using GCC. My application ISR uses floating point, and I #defined configUSE_TASK_FPU_SUPPORT = 2. From within all tasks that use FP I also call vPortTaskUsesFPU(). However, when my interrupt handler runs (typically 100 to 1000 times/sec) one of my FP-enabled tasks shows a calculation error that is only explainable if the task FP context is not being saved/restored properly (ie, when I disable the interrupt in that task before the calculation, and reenable thereafter, the problem is not seen. The problem is also not seen if the interrupt never executes.) Can you suggest what I need to do so that my task FP calcs are not affected by my ISR that uses FP?

Thank you,
Keith

From memory I don’t think the floating point context is saved and restored on interrupt entry, just on task switching. That would mean you can’t perform floating point operations inside interrupt unless you save and restore it manually yourself. That behavior could be changed, but only at the cost of much larger stacks and slowing interrupt entry and exit. To avoid that could you defer the processing that actually uses the floating point registers to a task, rather than performing the operation directly in the interrupt service routine? That can be done using something like xTimerPendFunctionCallFromISR() or simply by having your own task get unblocked by the ISR to perform the operation.

Unfortunately our real-time response requirements prevents us from deferring the work to a task. Is there some example code you can point me to that will save/restore the FP regs?

Thanks,

Hmm, it is a bit involved.

This is where the floating point context is saved if you switch a task context: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/V10.3.1-kernel-only/portable/GCC/ARM_CA9/portASM.S#L76 there is also the corresponding code to pop the contest in the restore macro https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/V10.3.1-kernel-only/portable/GCC/ARM_CA9/portASM.S#L109

Here you can see how the context is saved and restored when a task yields: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/V10.3.1-kernel-only/portable/GCC/ARM_CA9/portASM.S#L146

… but in an interrupt handler only the minimum is saved on entry: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/V10.3.1-kernel-only/portable/GCC/ARM_CA9/portASM.S#L162

…and only if the potentially nested interrupts determine a context switch should be performed is it actually performed (with the floating point registers included) when interrupt nesting has unwound and the interrupt is about to be exited:

So you would have to repeat the floating point save and restore, as done in save and restore macros (first couple of links) on the interrupt nesting entry and exit - rather than just if the interrupt determines it should perform a context switch.

Thanks Richard. A follow-on question for the Ultrascale port: FreeRTOSConfig.h defines macro configUSE_TASK_FPU_SUPPORT, but I don’t see any code that uses this macro. So it appears that all FP-enabled tasks must call vPortTaskUsesFPU(). Do you agree?

Looking at GCC\ARM_CA9\port.c:

#if( configUSE_TASK_FPU_SUPPORT == 1 )
{
	/* The task will start without a floating point context.  A task that
	uses the floating point hardware must call vPortTaskUsesFPU() before
	executing any floating point instructions. */
}
#elif( configUSE_TASK_FPU_SUPPORT == 2 )
{
	/* The task will start with a floating point context.  Leave enough
	space for the registers - and ensure they are initialised to 0. */
}

Does that answer your question?

No, since I am using the Ultrascale port (GCC/ARM_CA53_64_BIT). Can you comment on that case?

Looks like that works the same way. If you look at the irq handler it only saves the volatile integer registers on entry - you would also have to save the volatile floating point registers. Like the A9 port a full context switch is only performed at the end of the interrupt if one is actually required.

We could change this behavior - but that is how it has always been. It is done for efficiency.

I’ve added code to save and restore the FP regs, and that resolved my issues. But my last question remains: for the ARM_CA53_64_BIT port, I don’t see any code that uses the configUSE_TASK_FPU_SUPPORT macro. Can you point me to it?

Thanks,

FreeRTOS folks: I’m reviving this thread because it looks to remain unresolved and is definitely not something that we can just ignore. The remaining topics are as follows:

  1. The guide for FreeRTOS on a Cortex-A ( https://freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html ) points out that memcpy and the like will be optimized with GCC to use the FPU. It also hints that one can simply provide the definition of vApplicationFPUSafeIRQHandler instead of vApplicationIRQHandler but that doesn’t appear to actually be the case since vApplicationFPUSafeIRQHandler doesn’t appear anywhere in the FreeRTOS codebase. Likewise, the vApplicationIRQHandler is included in portASM.S, but there’s no vApplicationFPUSafeIRQHandler equivalent. Can you provide a vApplicationFPUSafeIRQHandler implementation to go along with the port and demo?
  2. configUSE_TASK_FPU_SUPPORT being set as either 1 or 2 is only referenced in the FreeRTOS code to give it a default setting of 1 in FreeRTOS.h. Is this something that we should just get used to and work around it, or is it likely to be changed (to work as described in the comments) in a future release of FreeRTOS with all its platform demos?

Thanks in advance for any insight here.

vApplicationFPUSafeIRQHandler is something the application writer is expected to provide: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/master/portable/GCC/ARM_CA9/port.c#L167

It would definitely be good if that was not the case. I’m not sure why it was left that way but suspect it is because the location of the FPU buffer isn’t known until the application writer provides it - or because there is no one way of doing it. In any case, if somebody wants to create a PR that demonstrates this it would be good! I presume the code can replicate that used to switch context when configUSE_TASK_FPU_SUPPORT is set to 2.

You should be able to set this to 2 in FreeRTOSConfig.h - but I think that only saves the floating point registers in the context switch, not on interrupt entry.

Thanks for the prompt response.

Understood, and fair point. But grep’ing for vApplicationFPUSafeIRQHandler in the source shows that it’s only referenced in the GCC/ARM_CA9 port, not in the GCC/ARM_CA53_64_BIT port. The Cortex-A web page ( https://freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html ) appears to really only apply to the GCC/ARM_CA9 port. The text that’s sort of in a title position on that page says it’s specific to the A9, but the link itself lacks the “9”, so it’s sort of confusing.

Grep’ing through the source in FreeRTOS 10.4.1 shows that configUSE_TASK_FPU_SUPPORT is present in the GCC/ARM_CA9 port.c, but not in GCC/ARM_CA53_64_BIT. So, as the OP stated, the port that works with the Zynq Ultrascale MPSoC does NOT include the configUSE_TASK_FPU_SUPPORT option. I believe the OP was able to work around this based on the back and forth discussion above, but it leaves the GCC/ARM_CA53_64_BIT/port.c in a state where this interrupt and FPU stack problem is still present and not handled in the same way as the CA9 port.

In the end, I guess the point is that the ports for the CA9 and CA53, and the corresponding demos for the Zynq A9 and Zynq MPSoC A53 platforms, are not actually similar in these ways, even though the web page ( https://freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html ) sort of implies that it’s valid for the Cortex A processors as a whole, with the exception of the three locations on the page where it specifically calls out the A9.

The final question, then, is whether or not the port and demo for the A53 will be updated to match what’s in the port and demo for the A9? If not, no worries, but it doesn’t seem like this difference is intentional. Thanks, again,for the interaction.

Reminding myself how this is working…

When context switching, I see the A53 port also tests to see if there is an FPU context https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/master/portable/GCC/ARM_CA53_64_BIT/portASM.S#L88 and if so saves the FPU registers (a few lines down from that link). All tasks start without a floating point context https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/master/portable/GCC/ARM_CA53_64_BIT/port.c#L260

The IRQ handler says it is saving the volatile registers, but does not save any floating point registers at that time, so maybe none of the floating point registers are considered volatile? Would need to check the compiler ABI: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/master/portable/GCC/ARM_CA53_64_BIT/portASM.S#L278

In all cases, this page https://freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html is only partially relevant to the 64-bit port and, as it is referenced from the 64-bit instructions (https://freertos.org/RTOS-Xilinx-UltraScale_MPSoC_64-bit.html), that needs to be made clearer.

What is the main interest here - making the documentation clearer, or providing IRQ handler that saves and restores volatile floating point registers?

Thanks, again, for the fast follow up. To be selfish, I only want to see the IRQ handler that saves/restores the FPU, because of all of the API functions (e.g., queue accesses) that use memcpy or similar, which end up making use of the FPU. Intentionally using a slower memcpy seems like a poor choice for common cases, so having the IRQ handler ensure that the FPU gets saved/restored if necessary seems like the wisest thing to do. Thanks, one more time, for the fast and in-depth investigation.

Ok.

As an aside, as far as the kernel’s use of memcpy() goes, creating your own implementation that just does a byte for byte copy is unlikely to be much slower, if indeed slower at all. That is because the kernel only moves small amounts of memory at a time, whereas the library version of memcpy() is optimised to move large amounts of memory around at a time. Therefore the library version carries a performance penalty at the start and end of the copy operation as it works to align the start address to something that enables it to start using a more efficient copy method.

Understood. Thanks for that clarification. By “common cases” I meant “scenarios where application developers are commonly going to use memcpy()/memcmp()/etc”, which has a built in assumption that it’s to act on large blocks of data. That assumption is entirely just an assumption and perhaps is wrong.

In a related vein, I stumbled on this thread GCC9 compiled code damage function local variable (related to VFP/FPU support) which makes me think that it’s not really as simple as only the tasks that use FPU instructions competing with ISRs that also use FPU instructions.