M4 port stack corruption

Hello;
I’ve encountered a really perplexing bug, and I’m hoping that someone has the cure. It’s somehow related to the FPU and the FreeRTOS-M4 port.

The application has multiple tasks which use the FPU. These are synchronized to an external, 250ms period interrupt from the AFE (bq76920 ALERT) which uses a counting semaphore and xSemaphoreGiveFromISR() to trigger a task that then reads an SMBus-based AFE. This task then triggers another task via xQueueSend, which reads the cpu’s ADC results for additional data, as well as confirming the AFE’s data.

All this works as anticipated…that is, until the AFE begins toggling its ALERT signal at 500 Hz, 100 times, due to a detected short circuit fault. When this occurs, the code crashes with a Usage fault INVSTATE…and this is repeatable.

My ETM trace shows that the AFE monitor task’s context is being clobbered somehow, and my theory is that it has something to do with nested interrupts and Lazy stacking and context switching. Again, this only occurs when the interrupt rate switches from 4 Hz to 500 Hz. The trouble is, I’m unable to track down where what’s happening. EWARM’s Fault Exception Viewer lists the offending instruction, but it’s different from the last instruction in the trace!

But as per the “Using Cortex-M3/M4/M7 Fault Exceptions” :
INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has attempted to execute an instruction that makes illegal use of the Execution Program Status Register (EPSR). When this bit is set, the PC value stacked for the exception return points to the instruction that attempted the illegal use of the EPSR. Potential reasons: a) Loading a branch target address to PC with LSB=0. b) Stacked PSR corrupted during exception or interrupt handling. c) Vector table contains a vector address with LSB=0.

And this is what appears to be happening, since the last PC address stacked is invalid, which immediately triggers this fault.

Any feedback would be greatly appreciated.
Thanks

Oh, and I forgot to mention that when I revert to the M3 port (and don’t use the FPU), the bug doesn’t appear. This why I’m focused on the Lazy stacking and nested interrupts theory.

Here are few ideas to start off with:

  1. Are you sure your interrupt can complete in the shorter time allotted when the rate increases to 500Hz (would have thought so, but worth asking).

  2. The Cortex-M ports re-use the stack that was allocated to the main() function as the interrupt stack once the scheduler has been started. That means the size of the interrupt stack is set by your project’s linker options, so the kernel has no [easy] way of knowing if it overflows. If you faster interrupt rate is resulting in more nested interrupts then it is possible you are using more interrupt stack, resulting in overflows that are not present with the lower interrupt frequency.

  3. Also read through the usual suspects here: https://www.freertos.org/FAQHelp.html (it sounds like you already have, but point it out all the same) - in particular ensure to be using a recent version of FreeRTOS with configASSERT() defined as the newer the version the more assert() there are to check the software configuration matches the hardware configuration.

  4. Ensure the compiler options correctly describe the floating point unit present on your hardware - they are not all the same.

Thanks for your reply. Yes, I should have also noted that I did confirm that this is NOT a stack overflow…at least it doesn’t appear to be…neither of a process stack nor the main stack…The handler mode stack (main) is 512 bytes (0x200), and I filled it with 0x5AA5. Even when the flurry of interrupts occur, only about 20% of the stack is used. And the process stacks are also oversized just in case, and I do enable configASSERT() (as a BKPT)…and I’ve confirmed that it detects errors.

I see no means of uploading screen captures, or I’d share my IDE when this occurs.

But again, while I cannot assert with metaphysical certitude this is related to the OS and nested, OS-aware ISR’s, I’m flummoxed that this doesn’t happen when using the M3 port. But what’s worse is that, since it seems clear that no other reports of this behavior have come to light, it must be a bug in my code…which causes me great annoyance and displeasure! ;-D), since I cannot run it to ground.

In the mean time, I’ll triple check on the stack sizes (I’ll double them).

To past an image drag and drop the image file into the text box used to write your post.

Maybe a bit off, but do I remember the M4F having issues with the use of the FPU inside an ISR (at least by default), and that there were some cases where even without actually using floating point, the compiler would use the FPU for data movement.

That’s interesting because when I first encountered a similar issue last year, it was the idle task’s context that was being clobbered, and I clearly saw in the trace that the EXEC_RETURN indicated it was restoring the FPU registers! I reported this to ST and IAR but neither were able to help. At that time I just gave up and moved to the M3 port. But having a similar bug occur again, I decided to bring it to this board in hopes someone else had seen and solved this issue. I’ll check the disassembly in the ISR, which should not be using the FPU. It merely clears the pending flag and posts the semaphore…

I checked through each ISR and found no use of the FPU. I doubled the stack size of the tasks using the FPU…no change.

Then on a whim, I increased the semaphore timeout to one second (from 250ms)…and now it doesn’t crash…at least not for the past several minutes.

This means my Lazy Stacking and nested interrupt theory must be hogwash, and points me back to my code as the culprit. Evidently, returning after a timeout results in somehow corrupting the task’s context…

Oh well, at least I’ve got a fix (or at least a band-aid ;-D)

Thanks for your help!
/*

  • Wait for the current I2C transaction to complete. Timeout is set to 250ms
    */
    OK = xSemaphoreTake(SMBusCallBackSemaphore, OS_ONE_SEC);

    if (OK == FALSE)
    ++numSMBusTouts;

If the crash only happens on a timeout, could it be some code after the take assumes the take succeeded, and thus you are having a resource conflict?

I tend to write all semaphore takes as

if (xSemaphoreTake(semi, time)) {
// Do the stuff that needed the semaphore
}

so that if I get a timeout (and I can add an else if I need to do something in the timeout) I don’t just blindly assume success. I do this even if the timeout value is PORT_MAX_DELAY as even in that case you can get a timeout return in special cases.

That appears to be the case. During the communication blackout
with the fuel gauge (while its busy dealing with the flurry of
Alerts from the AFE), the SMBus driver is looping, waiting for an
ACK. After increasing the timeout to 1 second, I note that this
comm’s blackout can persist for up to 700ms, after which comm’s
are restored. So yes, timing out an blithely returning must be
somewhere corrupting the context.

Thanks for your feedback!