STM32F407 stack corruption?

niklasnorin wrote on Wednesday, May 08, 2013:

I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.

I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.

The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:

1. The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
2. The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
3. The app_task is next and starts waiting on the app_msg_queue.
4. The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
5. The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
6. The app_task wakes up and I get a hard fault.

The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(…) -> os_queue_receive(…) -> xQueueReceive(…). This works well, but when it returns from xQueueReceive(…) it only manages to return to os_queue_receive(…) before it returns to a seemingly random memory location and i get a hard-fault.

The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.

I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.

I am really hoping that someone can help me out here!

niklasnorin wrote on Wednesday, May 08, 2013:

Can someone please remove one of my posts. Sourceforge was responding really slowly so i manage to create a double post…

travfrog wrote on Wednesday, May 08, 2013:

Are you using the FromISR APIs for the CAN interrupt?

rtel wrote on Wednesday, May 08, 2013:

Can you verify that you have looked through the notes on the following FAQ:
http://www.freertos.org/FAQHelp.html

and in particular, since you are using an STM32, ensured you have not only the interrupt priorities set correctly, but the assignment of bits between preemption priority and sub priority as noted in the red text on the following page:
http://www.freertos.org/RTOS-Cortex-M3-M4.html

Regards.

niklasnorin wrote on Thursday, May 09, 2013:

**travfrog: **Yes I am using only the FromISR API in the interrupt.

**richardbarry: **

Yes, I’ve looked through them both, several time. It is, of course, possible that I’ve missed or misunderstood something.  I call NVIC_PriorityGroupConfig( NVIC_PriorityGroup_4 ) before I even call any FreeRTOS API functionality.

The interrupts priority is set to configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY (or something similar, don’t have the code here) -1. Anyway the numerical value of 4 is used when setting the interrupt, is this correct if that constant is set to 5?

The priorities of the two tasks are set to tskIDLE + 1 and tskIDLE + 2 and I know that the CAN interrupt is only called once, as described in the original message.

I am really at a loss here and it is really important that I get it to work. I have tried to reduce the call depth by changing the network_receive(…) -> os_queue_receive(…) -> xQueueReceive(…) call in the app_task to  os_queue_receive(…) -> xQueueReceive(…)  by simply copying the content. This postpones the crash to when  os_queue_receive(…) is called the second time from the app_task (it is called in an never ending loop).

I have also implemented the hard fault handler code that you have suggested so i can read the program counter at the hard fault. Looking at it only points to some data location, setting a breakpoint there does not help since it is never reached for some reason.

davedoors wrote on Thursday, May 09, 2013:

The interrupts priority is set to configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY (or something similar, don’t have the code here) -1. Anyway the numerical value of 4 is used when setting the interrupt, is this correct if that constant is set to 5?

If configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY is set to 5 and your interrupt is using FreeRTOS then you must set the interrupt priority to 5 or higher. So a priority of 5 is ok, a priority of 6 is ok, but a priority of 4 is not ok because the folks at ARM in their wisdom say a priority of 4 is higher than a priority of 5.

Are you using the stack overflow detection?

niklasnorin wrote on Monday, May 13, 2013:

So a priority of 5 is ok, a priority of 6 is ok, but a priority of 4 is not ok because the folks at ARM in their wisdom say a priority of 4 is higher than a priority of 5.

Thanks for explaining that, it is now corrected but with the same problems.

Are you using the stack overflow detection?

Yes, and I’m not getting any stack overflows.

I still have the save problem. While debugging the program i can see in the call / stack trace that when the app_task returns it returns from ucHeap(), maybe this is the usual behavior. When the method which contains the xQueueReceive call in the app_task returns however, it tries to return to ucHeap and ends up in an " ; <UNDEFINED> instruction: 0xff01ffff" according to the disassembly.

niklasnorin wrote on Monday, May 13, 2013:

This is really hard to debug, I really still need some help with this one.

I’m not sure the ucHeap part is very important, nor the undefined instruction, but they could be. I tried what i described in my first post and reduced the call depth to the xQueueReceive in the app_task. I am not sure what have changed but this led to the function returning to the app_task function as intended, handling the item and then return to listen to the queue. When it calls xQueueReceive again, via the wrapper function, it works!

How come it works if i call xQueueReceive in one function, but fails if it is called two functions deep?

I really need to wrap this thing to ease the usability of my code, and I’m not sure what to do here. I could force the outermost function to be inlined i guess but It’s not pretty and only circumvents the problem.

rtel wrote on Monday, May 13, 2013:

This might take some to’ing and fro’ing to get to the bottom of.  I will ask you to post snippets to stimulate some discussion to see if other ideas come up.

Can you please start by cutting down the app_task code to the minimum that still shows the issue, then posting its code here.  To help with formatting, please replace tabs with two spaces (so the code is not so wide) and use the “codify text” formatting button which you will find above the text entry box in the forum (with a ‘<>’ symbol on it).

Regards.

niklasnorin wrote on Monday, May 13, 2013:

Thanks for helping me out, Richard. The code is deeply integrated in my application but the following is a cut down version, some

application specifics included for context.

Setup
Stack size for both tasks are set to 2048. The total stack and heap is both set to a minimum of 32k in the linker script. The

FreeRTOSConfig is mostly default but I’ve changed some defines:

#define configTOTAL_HEAP_SIZE  ( ( size_t ) ( 32 * 1024 ) )
#define configCHECK_FOR_STACK_OVERFLOW	2
#define configUSE_MALLOC_FAILED_HOOK	1
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY	5

The CAN interrupt now has an interrupt priority of 6, one higher then the configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY.

app_task

#define test 0
void app_task( struct ProcessContext* processContext ) {
  struct Service logService;
  struct RovCanMessageMetaData messageMetaData;
  static union LogMessageType message;
  
  for(;;) {
#if test == 1
    /* This does not work */
    receiveMessage( &logService, &messageMetaData, &message );
#else
    /* This works, even though it is simply cut and pasted from inside receiveMessage */
    uint8_t tmpBuffer[ sizeof(struct RovCanMessageMetaData) + 8 ];
    queue_receive( serviceInternal->receiveQueue, tmpBuffer, WAIT_FOREVER);
    memcpy( &messageMetaData, tmpBuffer, sizeof(struct RovCanMessageMetaData) );
    memcpy( &message, tmpBuffer + sizeof(struct RovCanMessageMetaData), messageMetaData.messageSize );
#endif
  }
}

network_task

void network_task( struct ProcessContext* processContext ) {
  for(;;) {
    queue_receive( internal->canRxMessageQueue, &message, WAIT_FOREVER );
    /* A lot cut out for brevity, the end result is the following */
    /* Copy the meta-data to the buffer */
    memcpy( singleMessageBuffer,
            &messageMetaData,
            sizeof(struct RovCanMessageMetaData) );
    /* Copy the message to the buffer */
    memcpy( singleMessageBuffer + sizeof(struct RovCanMessageMetaData),
	    message->payload,
	    message->payloadSize );
    /* Send the buffer to the queue, simply wraps the RTOS call */
    queueStatus = queue_sendToBack( serviceInternal->receiveQueue,
                                    singleMessageBuffer,
                                    0);
  }
}

can interrupt

void CAN_PORT1_RX_INT_HANDLER(void) {
  portBASE_TYPE higherPrioTaskAwaken;
  /* A lot cut out for brevity */
  /* This is called two function calls deep inside the interrupt */
  queue_sendToBackFromISR( rovosNetworkInternal->canRxMessageQueue, (void*)message, higherPrioTaskAwaken );
  portEND_SWITCHING_ISR( higherPrioTaskAwaken );
}

All this works if the define “test” is set to 0, I can send any number of can messages and they are passed all the way up to the

app. This i can repeat as many times as I want. If i set “test” to 1, however, then it does not return from inside receiveMessage

in app_task. When it tries to return from receiveMessage it goes to Hard Fault.

Another wierd thing is if I set “test” to 0, which worked before, then change the two structs and the union in app_task to be

static then it goes to Hard Fault at

if( xTaskResumeAll() == pdFALSE ) {
  portYIELD_WITHIN_API();
}

inside xQueueGenericReceive. This means that after one message has been succesfully received all the way to the app, when the app

tries to listen to the queue again it goes into a hardfault via portYIELD_WITHIN_API().

Overall it feels as if something is really wrong with the heap or stack. If you need any more information, please tell me and I

will provide it.

Thanks again for taking your time and looking into this!

rtel wrote on Monday, May 13, 2013:

Minor point not related to the purpose of this thread, but

portBASE_TYPE higherPrioTaskAwaken;

should be

portBASE_TYPE higherPrioTaskAwaken = pdFALSE; /* or simply 0 */

so it is initialised before it is used.  It might be that you initialise later explicitly in the code you have cut out for brevity.  Don’t bother answering this point, it’s not really relevant.


So far I can’t see anything that stands out. 

tries to listen to the queue again it goes into a hardfault via portYIELD_WITHIN_API().

portYIELD_WITHIN_API() should be a benign macro.  It just sets a bit in a core register that is always there, and always accessible to a privileged task.  It also should not use any stack.  If this is where the problem is manifesting itself then I would guess that something is going wrong prior to that in xTaskResumeAll(), or, and more likely, that portYIELD_WITHIN_API() is executing correctly and unseen by you the interrupt it triggers is taken and the crash is actually happening in the interrupt service routine.

Please post the receiveMessage() function too as that seems to be the offending function call.

Regards.

niklasnorin wrote on Monday, May 13, 2013:

Well, the content of receiveMessage() is really the lines between the #else and #endif inside app_task, nothing else. And queue_receive(…) just calls xQueueReceive(…).

I’ll continue to search, but I have no idea what the problem might be.

niklasnorin wrote on Monday, May 13, 2013:

To be able to continue work in parallel with solving this i introduced some more code and noticed something. “test” is set to 0, referring to the code from earlier.

I introduced another struct in app_task. When the process runs for the first time this is used to initialize a peripheral and it works correctly. When the app_task is resumed when data is arrived this struct is completely garbled, I guess this would indicate some kind of stack failure. If i mark this struct as static, everything works as intended.

rtel wrote on Monday, May 13, 2013:

It is much more likely to be something writing over RAM that is being used as the task stack than the stack corrupting itself (once allocated, the stack is just used as any other stack with the compiler taking care of stack frames, etc.).

Here are a couple of debugging ideas.

1) Open up a memory window that shows the RAM you know is getting garbled.  Once the structure has been used (before it is corrupted) if your hardware allows set a data breakpoint that will cause the debugger to break when one of the corrupted RAM locations is next written to.  If your hardware does not allow that, then simply leave the memory window open and step through the code until you notice the values in the memory window changing.

2) Try commenting out the memcpy/memset calls in your tasks.  That is one likely source of problem.

3) Double/tripple check the the function calls used to create the queues.  Do you have the parameters the right way around?  Is the size of the items being queued set correctly?

4) I think you said you were posting pointers to buffers into the queues, are the pointers being managed correctly?  Are you passing the address being pointed to into the queue rather than the address of the pointer variables?

Regards.