Hardfault with corrupt/strange MSP

In the process of tracking down an errant hang, I’ve gotten myself to a point where I can reproduce a hard fault within ~5 minutes of run time, but the hard fault state makes no sense to me. Looking for some help in understanding what I am seeing…

Project is based on an LPC1517 cortex-m3/ FreeRTOS 10.0.0 using all statically allocated objects/constructs - no heap/malloc/free

I currently have 5 threads:
Idle, priority 0
timer, priority 2
main loop, priority 1
CAN rx, priority 1
CAN tx, priority 1

There are 3 queues:
CAN rx,
main event,
CAN tx

The CAN rx thread blocks on the CAN rx queue waiting for CAN frames and either posts them directly into the main event Q or assembles multi frame requests then posts into the main event Q. The main loop blocks on main event Q, handles the event request then posts one or more reply frames into the CAN tx Q. The CAN tx thread blocks on the CAN tx Q then sends each frame it pulls off of the Q. After sending a frame, the CAN tx thread blocks on a direct task notification for the completion of the transmission then goes back to blocking/pulling from the CAN tx Q.

The CAN rx Q is fed by the CAN interrupt via a call to xQueueSendFromISR. The CAN tx thread notification is provided by the CAN interrupt via a call to vTaskNotifyGiveFromISR.

The only 2 interrupts currently active are the CAN interrupt which has a priority of configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY, this is set with the CMSIS function that provides the proper shifting of the bits to the correct location and a watchdog interrupt that is currently just resetting itself with no calls to any FreeRTOS API.

I have code in place to setup my stacks (including idle and timer) with special values that I can see from memory dumps that there are no stack overflows. I also have the stack checking code turned on. Additionally I have asserts defined to halt the processor in an infinite loop should an assertion fail.

The system will run for ~ 5 minutes, accepting a CAN request and replying properly. A hard fault will occur. What I see in the registers does not make sense to me (the following are example from the latest test - they may vary a small bit from run to run). Inside the hard fault the MSP has a value of 0x20005CC, the PSP 0x20005A8. Both of these are valid stack for CAN rx thread. However, the memory for this chip runs from 0x2000000-0x2003000 - the MSP should be somewhere towards the end of that range. It is there any other time I stop the execution.

I can find no place in the disassemble code where the MSP is being explicitly set other than the scheduler start code. Furthermore, the memory between where the MSP should be and where it is, is generally correct - i.e. there has been no loop or recursion that is actually stacking to this point.

In this case, I have a LR of 0xFFFFFFF1, so I got here during an interrupted exception, but I have seen the LR with 0xFFFFFFFD as well so that implies that it is not related to an expection. Even more frustrating here, in one instance the LR was 0xFFFFFFFF which makes absolutely no sense to me. Additionally, the automatically stacked values inside the hard fault handler are suspect as well: PSR = 0, LR = 0xA5A5A5A5 (one of my stack magic values), and the PC points to line 3145 in tasks.c in xTaskRemoveFromEventList.

I’m sure I’ve left out useful info, but I’ve been staring at this for 2 weeks…

I guess the ask here is:

  1. Does the general approach sound correct? Is there something missing from FreeRTOS stand point that could be causing this?
  2. Anyone with lots of experience at hard faults please chime in and help me understand how I can be seeing the values I see. Is there someway to corrupt the MSP without explicit calls to set it or without validly stack to that value?
  3. Any other helpful suggestions to get past this…
  1. I think the app design sounds good. Nothing stands out to me as problematic from a FreeRTOS standpoint.
  2. Since the MSP is manipulated any time it’s the active stack pointer, then any kind of context corruption or main stack corruption could cause MSP to receive an invalid value.
  3. Your description tells me you have already explored the usual suspects. Here are some unusual suspects:
  • Variables on the main stack, created prior to starting the scheduler, that are intended to survive after starting the scheduler. They don’t.
  • Incorrect installation of FreeRTOS vectors. See the #define’s in item #1 here.
  • Main stack overflow. FreeRTOS does not check the main stack for overflow, even if you turn on stack checking. From your description it sounds like you have task stacks pretty well covered, but I couldn’t tell if you are also checking the main stack (the ISR stack) for overflow.

Isolated this to a smaller project and posted that up in a new thread here: https://forums.freertos.org/t/xqueuereceive-failing-with-corruption/11022