HardFault randomly in xEventGroupWaitBits

rickou · October 4, 2024, 7:58am

ok, so in my case, the priority levels should not be the problem …

i checked again when the crash happen…
the content of the eventgroup seems to be fully erased… (but not the pointer to the structure)

i suspect the pb comes from here, but i need to trap where/why it is erased…

richard-damon · October 4, 2024, 12:59pm

If the EventGroup is being overwritten, it is likely that something (likely the thing allocated just prior or after) is being used to access outside its bounds.

One trick to help catch it is to statically allocate your FreeRTOS objects (with the xxxCreateStatic functions) and then you can easily see what the objects before and after are and which one looks to have overwritten it. That or it moves the overright to something totally different, but again, being able to look at addresses and see what objects they are part of helps locate who is not keeping inside there boundries.

rickou · October 4, 2024, 1:20pm

yes, i have identified a way to get the crash almost 100% times…
defining statically the objects seems to no easy to reproduce.
for example, what i identified, if the eventgroup is at address 0x20005FA8, i get the crash 100% time at the exact same code address…
with the same code, the eventgroup address could be 0x20005f10 or 0x20006040. for both addresses the code never crash…

i know if the address change before or after, it is linked to other code allocate or not an object
i identified also, if i disconnect my Ethernet cable from my board, the event group is always set at an address that works… but… when the address is the “bad one”… there is no IRQ related to Ethernet between eventgroup allocation and the time it is overwritten…

it drives me crazy…

i continue investigating… (i start to disable all events in the code)

richard-damon · October 4, 2024, 1:32pm

That is one reason that allocating everything at the beginning can be helpful as the relationship of all the objects will be fixed. Apparently, your program startup path is dependent on external conditions, like the presence of the ethernet cable, which means that what ever is currently clobbering the EventGroup will at other times be clobbering something else (and maybe in a way that isn’t as obvious). Getting the behavior consistent is the first step to finding where the problem is.

rickou · October 4, 2024, 1:39pm

fully agree !
but if i change too much things now, it could works and never find where is the bug.
this is why even with dynamic allocation and environment like Etherent cable, i tryed to disable most of events, so most of tasks are now doing nothing…

now, my ethernet driver never send read packets to the IP stack… but the crash still here 100%.
I will find it ! then i will take a good beer !

is there a way to do some trace at least at kernel level ?

richard-damon · October 4, 2024, 1:45pm

Just remember that just because it is the presence or absence of the ethernet cable that moves the block into a “good” or “bad” position, doesn’t mean that it is the ethernet system that is causing the problem, just that it happens to move stuff into the bad configuration.

One technique that can be done (if your debugger supports it) is to find the address on one of the words getting clobbered, and putting a hardware write breakpoint there, and see what is writing over it.

rickou · October 4, 2024, 1:59pm

yes the debugger support breakpoint on data change…
but seems to no works as i expect, because it stops at an instruction that clear a MCU register…
and if i do assembly step by step around this instruction, before this instruction, the eventgroup is ok, just after it is corrupt… but again the instruction itself do not do anything over the memory…
just for “fun”, it is here…

before the instruction r3=0 S15=0x7270E00 and eventgroup is ok
if i do a single assembly step

the eventgroup is erased… and S15=0 as the instruction should do…

so i suspect the assembly stepping is not working as expected and run more code… that’s why i think it could be linked to some IRQ…
i think the kernel can’t change executing context in this context…

hs2 · October 4, 2024, 2:20pm

Seems you’re using the FPU at least in the generated code.
So you’re using ARM_CM4F port supporting usage of the FPU ? Just to be sure …

rickou · October 4, 2024, 2:22pm

I’m using the ARM_CM33_NTZ port

if i’m right, vmov is an instruction related to FPU … the FPU is enabled for the build, but my freeRTOS config have configENABLE_FPU set to 0

could it be a pb linked to this ?

in fact i just put a breakpoint in PendSV_Handler and this handler is called some instruction before the corruption…
(i don’t know what it does)

hs2 · October 4, 2024, 2:25pm

That’s a license to crash b/c the (additional) FPU register set is not saved/restored on task switch.
Anything can and will happen then.

rickou · October 4, 2024, 2:43pm

i understand and fully agree

but during my test, there is no context switch during the time the eventgroup is corrupt…

but to be sure there is no other hidden thing, i try to understand where/why it’s happen…

hs2 · October 4, 2024, 2:53pm

But I think you should #define configENABLE_FPU 1 to ensure correct, consistent behavior.
Also on HW interrupt the FPU regs are saved/restored (or not) by the CPU.

rickou · October 4, 2024, 3:24pm

for sure !

so, i identified a problem that comes from my bad configuration.
at the return of a context switch, the FPU registers should be restored lr<4>=0 (but not saved at the entrance… lr<4>=1🤔)

BUT, i still don’t know where/who corrupt the memory…
For me, it is not a full resolved pb… but it is now time for weekend !

1st run with configENABLE_FPU set to 1 works… but how many time…

Thank you guys for your tips and help !