Wrong Register value gets saved after Context Switch/Interrupt

Hello community,

sometimes I get the wrong value pushed to the LUA stack.
This value was in the register R3 and gets pushed via lua_pushnumber().
But in the memory the wrong value is seen afterwards.

I’m working on an older project.
It’s running on an Xilinx Zynq 7000 Cortex A9 with the 2017.1 Version of Xilinx SDK (GCC).
RTOS has the version 9.0.
There are around 15 task running.
Most of them are waiting for a message from the bus in the corresponding queue.

While the LUA task is actively running it can happen that one task from a high performance bus gets to interrupt the LUA task and after returning it, LUA is writing the wrong value. Most notably if the floor() function tries to floor a value. On integer it just get changed to 0 but on floats it throws an exception because “the value given is not a number”.

I changed the config and excluded all defines which are just for statistics and the like.
I activated FPU for all tasks.
I checked the current demo with my version and got across 1 line which differs from the implementation of vApplicationIRQHandler(), which is this:
/* Re-enable interrupts. */
__asm ( “cpsie i” );
But even after adding this nothing changed.

I painted the stack boundaries and monitored them, increased heap and stack size of the whole project and RTOS tasks to no avail.
The interrupts are only using RTOS functions with the ending fromISR and no memmove()s/memcpy()s.
The tasks use queues but no malloc()s and some use memmove()s.

The priority of the task which interrupts LUA is 5 and LUA is 1.
Changing this or wrapping this in a critical section did help, but putting the LUA script over the task which handles the bus messages doesn’t seem right to me.
Since the bus task has a harder real time requirement than LUA.
Plastering the LUA library with critical section seems stupid as well.

With an oscillator I could see that it’s always the same behaviour.
LUA task get’s selected from a random task prior (either bus or even IDLE task).
LUA reads the value in the array which holds the bus value.
High performance bus kicks in for couple of us (mostly checking the queue and seeing that the message is not of importance).
Returning to LUA and LUA pushes a wrong value to it’s stack and either crashing or sending 0 afterwards.
Sometimes the system is running 45 min without error, sometimes 1 minute.

My guess would be that something messes up the pop of the correct values.
Since R3 sometimes holds 0606060 of R6 afterwards.

Currently it is possible to force the error by using a breakpoint at these 2 lines of code and running them via play instead of stepping. Stopping 1 line above does not force the error to occur (math_floor() of mathlib.c):
lua_Number d = l_mathop(floor)(luaL_checknumber(L, 1));
pushnumint(L, d);

Has someone experienced something like this before or could pinpoint me where to look or share their experience how you would debug something like this?

Is LUA a FreeRTOS task here?

Which FreeRTOS port are you using? Any reason you cannot move to the latest FreeRTOS version?

Can you help me understand this? Are you saying that LUA task is interrupted in the middle of a context switch? or LUA task’s saved state is corrupted by some other task?

You mean the saved state of LUA task is corrupt and as a result, when the task is again selected to run, the state is not restored correctly? If yes, can you use data breakpoint to catch who is corrupting that?

Yes, LUA is a task only running if the user provides a script.
It’s kinda, lets say, the cherry on top of functionality.
Only providing an easier access to manipulating how the data is shown, like put a factor on it before sending an additional message through the bus. That’s why its got a low task-priority.

The version of FreeRTOS is V9.0.1 and the port is portZynq7000 (Bsp is named: freertos901_xilinx_v1_0). I hope you wanted this information, otherwise could you specify what you’re refering to?
The reason why it’s not moving to the latest version is mostly because of priorities.
It’s “working” enough with some error handling inside of the script that it’s not necessary to upgrade it and the upgrade would need excessive testing afterwards. Since Xilinx moved on to Vitis and the current libraries in use are obsolete or changed it would also tale time in this regard. I can’t say anything to only upgrading RTOS due to knowledge issues with the port of Xilinx.
Its comes mainly down to: it bugs me on a personal level and I’m doing it on the side since it’s a low prio issue.
I just really wish to understand how this happens and fix it without these workarounds :smiley:

LUA is running in a while(TRUE) loop. In the loop there is a vTaskDelay().
But LUA gets interrupted while calculating the floor of a number.
So it’s not actively suspended by “me”.
When checking the logs, it’s always a high priority task which interrupts this. So a context switch happens and when it returns from said high prio task back to LUA, LUA writes the wrong number into it’s own stack (lua_State struct). I’m not exactly sure how or what happens with the data. I can only see the aftermath that it’s wrong.

That’s one of my problem that I don’t know how I can catch that.
When stepping through the code it does’nt happen. Only when I “let go” with play instead of steps. I can see that RTOS is correctly writing the data into the supervision stack. When I let it run to retrieve it, it’s already overwritten. Could you give me a pointer where to set the breakpoints in the RTOS library to help understand the inner workings? I’m not that familiar with RTOS at this level of functionality and maybe it’s just my error where I look.

Thanks!

Which file in the portable folder are you using?

I am not familiar with this port but is it possible that floating point registers are not saved as part of the context?

Can you show the code snippet which writes this number? I am trying to find out where is the data getting corrupted stored.

One way of doing that is declaring a variable next to the one getting corrupted. Now, this variable is unused and therefore, must not change. You can put a data breakpoint/watchpoint for this new variable which essentially says “break when the contents at this address”. When the debugger stops on the data breakpoint, you likely have caught the cause of the data corruption.

It’s from the Xilinx page. It’s just the v9.0.1 we are using instead of the v10 I linked. The only difference in the significant files are the parts with “:: memory” as far as I can tell.

Since the config has FPU set to 2 it should be saved code and restored code unless I missunderstood it. But vpush d0-d31 should be the correct register for saving FPU, I think.

Sure thing! You can find the function here which returns a broken integer or here which returns an error due to the value not beeing a number. We are using version 5.3.4 of this. But the significant part is the return of the floor which is the same. In the first case I get 0 instead of the correct floor and in the second case 2.4e-32 (or something like that) which is in the next line “if (n != f)” the resulting error.

When I come around to test/check it I’ll report back.

Thanks!

Can you double check that the version that you are using has the support for configUSE_TASK_FPU_SUPPORT = 2 - https://github.com/Xilinx/embeddedsw/blob/779a67147ef3ea40c6e24556b5ff3b79f435358c/ThirdParty/bsp/freertos10_xilinx/src/Source/portable/GCC/ARM_CA9/portmacro.h#L176

/* If configUSE_TASK_FPU_SUPPORT is set to 1 (or left undefined) then tasks are created without an FPU context and must call vPortTaskUsesFPU() to give themselves an FPU context before using any FPU instructions. If configUSE_TASK_FPU_SUPPORT is set to 2 then all tasks will have an FPU context by default. */
#if( configUSE_TASK_FPU_SUPPORT != 2 )
void vPortTaskUsesFPU( void );
#else
/* Each task has an FPU context already, so define this function away to nothing to prevent it being called accidentally. */
#define vPortTaskUsesFPU()
#endif

with

#define configUSE_TASK_FPU_SUPPORT 2

defined in the FreeRTOSConf.h this will enable it

#elif( configUSE_TASK_FPU_SUPPORT == 2 ) {
/* The task will start with a floating point context. Leave enough space for the registers - and ensure they are initialised to 0. */
pxTopOfStack -= portFPU_REGISTER_WORDS;
memset( pxTopOfStack, 0x00, portFPU_REGISTER_WORDS * sizeof( StackType_t ) );
pxTopOfStack--;
*pxTopOfStack = pdTRUE;
ulPortTaskHasFPUContext = pdTRUE;
}

I attached the cfg aswell.

FreeRTOSConfig.h (4.3 KB)

It is kind of hard to make a guess any further. Do you have a way to repro this corruption reliably and consistently (asking because that helps in debugging)? Drop me a DM if you’d like to setup a debug session to debug this together.

Thank you for the offer, I will take it the moment I’m completely stuck on the problem.

With your earlier suggestion I could find the cause in which the corruption might happen.

When I’m stopping in the debugger at this line and use single step I end up in this line.
In the assembly I see these instructions when entering the function:

push {r11,lr}
add r11,sp,#4
sub sp,sp,#16
str r0,[r11,#-16]
vstr s0,[r11,#-20]
vldr s15,[r11,#-20]

The instruction vstr is the interesting part and looks like this (good case):

vstr 0x43C80000,[0x0062E574,#-20]

With the same setup but changing single step to play I get an interrupt from RTOS and the vstr instruction suddenly looks like this (bad case):

vstr 0x06060606,[0x0062E574,#-20]

Which pushes the wrong result into the LUA state struct.

Now following this I could find the the moment the float register is overwritten.
It’s the instruction BLX r1 (BLX 0x001C8398) in the FreeRTOS-IRQ_Handler. No float register are saved prior this instruction so this should be the cause, I think.

But after reading a bit about BLX I don’t understand how a branch of instruction sets can change the floating register. Do you have an idea or insight what’s happening?

Edit:
After stepping through the assembly instead of the “code” I change into an ISR.
The trace of calls looks like this before overwriting the floating registers:

taskLuaSkript()
lua_pcallk()
luaD_pcall()
luaD_rawrunprotected()
f_call()
luaD_callnoyield()
luaV_execute()
luaD_precall()
math_floor()
FreeRTOS_IRQ_Handler()
FreeRTOS_IRQ_Handler()
isrBusMessage()
xQueueGenericSendFromISR()
prvCopyDataToQueue()
memcpy()
=> vldr d0,[r1]

After some reading on this topic it should be safe to use memcpy but some experienced problems on Zynq. In the FAQ it’s also written that this might happen with some libs.
I thought the ISR is safe since I used configUSE_TASK_FPU_SUPPORT = 2 and have no optimization enabled Optimization Level None (-O0).
Is this thought wrong? Do I need to add an additionalVPUSHNE {D0-D15} to FreeRTOS_ApplicationIRQHandler or is the best practice to add this to the ISR (isrBusMessage()) itself? As alternative I could also implement my own memcpy()?

I see that you are using FPU registers in ISR. What is the name of your application IRQ handler? Can you make sure that it is vApplicationFPUSafeIRQHandler and NOT vApplicationIRQHandler?

Ok, this is a stupid misunderstanding by me. I now understood what exactly was meant by renaming the IRQ handler.

What confused me is the naming in the Xilinx implementation. vApplicationIRQHandler is implemented as weak in portASM.S and calls vApplicationFPUSafeIRQHandler. The IRQ handler which the code comments and you refer to is named FreeRTOS_ApplicationIRQHandler in portZynq7000.c and is directly called from FreeRTOS_IRQ_Handler in portASM.S. Prior to this I thought I should rename FreeRTOS_ApplicationIRQHandler with its occurence in FreeRTOS_IRQ_Handler since it would not run otherwise or since this clearly did’nt work rename the weak function vApplicationIRQHandler which also clearly did nothing. The comment was meant to change the execution order to call the weak function first before the implemented handler. After understanding that FreeRTOS_IRQ_Handler should call the weak function vApplicationIRQHandler which then calls FreeRTOS_ApplicationIRQHandler (or nowvApplicationFPUSafeIRQHandler) it now saves the FPU correctly.

In the latest version of the Xilinx repository it is thankfully renamed to the default names which now makes the comment meaningful/understandable again.

Thank you for your help and patience!

Glad that it worked for you!