xQueueReceive failing with corruption

blondguy · October 30, 2020, 8:53pm

FreeRTOS V10.2.0
LPC1517 - Cortex M3

This is a continuation of this thread: https://forums.freertos.org/t/hardfault-with-corrupt-strange-msp/11004 but I’ve opened a new thread because I have completely created a new project in trying to track this down and it sufficiently different I felt it warranted a new discussion.

In working on that problem I have created a completely minimalistic project that manifests the problem I have been seeing. In this project there is just a main thread, the idle thread, and a single queue. Again everything is done statically. main() sets up the Q, the main thread and a repetitive timer. The main thread waits for an item in the Q and toggles an LED. The timer interrupt simply puts and item (a uint32_t) into the Q. The net effect is a toggle of the LED when something goes through the Q, that’s all the project does, here is the code:

#define QUEUE_LENGTH_RX 		10
#define TASK_STACK_SIZE_MAIN	256

static StaticTask_t _task_main;
static StackType_t _task_stack_main[ TASK_STACK_SIZE_MAIN ];
static TaskHandle_t _task_hdl_main;
static void main_task( void * pvParameters );
StaticQueue_t _queue_hdr_rx;
static uint8_t _queue_buf_rx[ QUEUE_LENGTH_RX * sizeof( uint32_t ) ];
QueueHandle_t _queue_hdl_rx;

uint32_t frame;
uint32_t main_event;

#define DISABLE_WRITE_BUFFER
//******************************************************************************
int main(void)
//******************************************************************************
{
#ifdef DISABLE_WRITE_BUFFER
	SCnSCB->ACTLR |= SCnSCB_ACTLR_DISDEFWBUF_Msk;
#endif

	SystemCoreClockUpdate();

	Chip_GPIO_Init(LPC_GPIO);
	Chip_GPIO_SetPinDIROutput(LPC_GPIO, GREEN_PORT, GREEN_PIN);

	Chip_Clock_EnablePeriphClock(SYSCTL_CLOCK_RIT);
	Chip_RIT_Init(LPC_RITIMER);
	Chip_RIT_SetTimerIntervalHz(LPC_RITIMER, 6);
	NVIC_SetPriority(RITIMER_IRQn, configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY);
	Chip_RIT_Enable(LPC_RITIMER);
	NVIC_EnableIRQ(RITIMER_IRQn);

	_queue_hdl_rx = xQueueCreateStatic( QUEUE_LENGTH_RX,
								 sizeof( uint32_t ),
								 _queue_buf_rx,
								 &_queue_hdr_rx );
	vQueueAddToRegistry(_queue_hdl_rx, "rx");

	_task_hdl_main = xTaskCreateStatic(
						main_task ,
						"main_task",
						TASK_STACK_SIZE_MAIN,
						NULL,
						(tskIDLE_PRIORITY + 1UL),
						_task_stack_main,
						&_task_main);

	vTaskStartScheduler();
	return 0 ;
}

//******************************************************************************
static void main_task( void* ctx )
//******************************************************************************
{
	while(1){
		xQueueReceive( _queue_hdl_rx , &main_event , portMAX_DELAY );

		Chip_GPIO_SetPinToggle(LPC_GPIO, GREEN_PORT, GREEN_PIN);
	}
}

//******************************************************************************
void RIT_IRQHandler(void)
//******************************************************************************
{
	Chip_RIT_ClearIntStatus(LPC_RITIMER);

	xQueueSendFromISR( _queue_hdl_rx, &frame, NULL );
}

This code will run no longer than 15 minutes and would always hard fault (in memcpy) until I added some asserts at the beginning of xQueueReceive & prvCopyDataFromQueue. With the asserts in place, it will hang at a failed assertion in prvCopyDataFromQueue. All the assertions do is verify that the passed in args are actually pointing to the correct place (the globals main_event & _queue_hdl_rx).

In xQueueReceive:

BaseType_t xQueueReceive( QueueHandle_t xQueue, void * const pvBuffer, TickType_t xTicksToWait )
{
BaseType_t xEntryTimeSet = pdFALSE;
TimeOut_t xTimeOut;
Queue_t * const pxQueue = xQueue;

	configASSERT( 	pxQueue == (Queue_t*)&_queue_hdr_rx);
	configASSERT( 	pvBuffer == (void*)&main_event);

In prvCopyDataFromQueue:

static void prvCopyDataFromQueue( Queue_t * const pxQueue, void * const pvBuffer )
{
	configASSERT( 	pxQueue == (Queue_t*)&_queue_hdr_rx);
	configASSERT( 	pvBuffer == (void*)&main_event);

After running a while, the first assert in prvCopyDataFromQueue fails. Moving up to the stack frame for xQueueReceive and examining the values yields this:

p {&_queue_hdr_rx, &main_event}
$19 = {0x2000794 <_queue_hdr_rx>, 0x2000814 <main_event>}
p {&xQueue, &pvBuffer}
$20 = {0x200032c <uxIdleTaskStack.9178+372>, 0x2000328 <uxIdleTaskStack.9178+368>}
p &pxQueue
$21 = (Queue_t * const *) 0x2000348 <_task_main+16>
p {xQueue, pxQueue}
$22 = {0x875 <prvTaskExitError>, 0x2000338 <_task_main>}

It can be seen from this that the 2 arguments to xQueueReceive, xQueue & pvBuffer are now pointing to locations in the idle task stack NOT the main task stack even though xQueueReceive is being called from the main task.

It is interesting to note that in this frame the SP is 0x2000740 and if we examine the memory there we see:

x/4w 0x2000740
0x2000740 <_task_stack_main+944>:	0xa5a5a5a5	0xffffffff	0x02000814	0x02000794

The last 3 words there are the arguments to xQueueReceive we would expect to see (portMAX_DELAY = 0xFFFFFFFF). Its probably worth mentioning at this point that MSP is corrupted as well with the value 0x2000fe0, it should be some where just below 0x2003000 - the end of RAM.

The RIT is the only interrupt active outside of FreeRTOS and its priority level is set correctly. I’m at a loss here. It appears that the wrong stack is being restored to the main task thread. Now I realize that 99% of the time it is user error, but I have no idea of what could be done wrong in this short little example.

Have I done something wrong in this example?
Why are these values so wacked out?
Why does it run for a while (sometimes fails at 5 minutes, sometimes 10, never gets past 15)?
How do I debug this further?

rtel · October 30, 2020, 9:02pm

Its great you have managed to isolate the issue to such a small project. Are you able to zip up the entire project, including the IDE files, linker script, etc. and post it to the forum so I can take a look? Make sure to clean the project first so the binaries are not included. Also, please remind me how you are building (makefile, MCUXpresso, or something else).

blondguy · October 30, 2020, 9:32pm

The project is being built via MCUXpresso version 11.1.

I have also included a directory containing the memory dump, the disassembly, the register states for both frames, the map file, and the binary from the run/failure I used as the basis for my post.

It is running on custom hardware using an LPC1517, but it should run on anything in the LPC15XX family with a tweek to the memory linker script file.

Thank you for looking into this - I’ve been wraking my brain for quite a while. Please let me know if there is any information that is missing.

queue_fail.tar.gz (1.1 MB)

hs2 · October 30, 2020, 10:05pm

It‘s likely not the cause of the trouble you‘re in but better swap queue creation and enabling the interrupt with its ISR using the queue this way:

_queue_hdl_rx = xQueueCreateStatic( QUEUE_LENGTH_RX,... )
NVIC_EnableIRQ(RITIMER_IRQn);

I enable an interrupt last e.g. in its already started serving task as part of the preamble before entering the task forever loop.

rtel · October 30, 2020, 10:09pm

I’ve not seen anything in the code that doesn’t look right, but have a query regarding the memory map. This is in the linker script:

MEMORY
{
  /* Define each memory region */
  MFlash64 (rx) : ORIGIN = 0x0, LENGTH = 0x10000 /* 64K bytes (alias Flash) */  
  Ram0_4 (rwx) : ORIGIN = 0x2000000, LENGTH = 0x1000 /* 4K bytes (alias RAM) */  
  Ram1_4 (rwx) : ORIGIN = 0x2001000, LENGTH = 0x1000 /* 4K bytes (alias RAM2) */  
  Ram2_4 (rwx) : ORIGIN = 0x2002000, LENGTH = 0x1000 /* 4K bytes (alias RAM3) */  
}

and this is output by the linker:

Memory region         Used Size  Region Size  %age Used
        MFlash64:        9244 B        64 KB     14.11%
          Ram0_4:        2072 B         4 KB     50.59%
          Ram1_4:          0 GB         4 KB      0.00%
          Ram2_4:          0 GB         4 KB      0.00%

…however the data sheet for the LPC1517 says Ram0 and Ram 1 are 16K bytes, not 4. I don’t know if that is related to the issue, but I wonder if there are other things that don’t appear right in the setting?

rtel · October 30, 2020, 10:12pm

Can you try adding -fno-builtin to your compiler command line to see if that makes a difference. Thanks.

blondguy · October 30, 2020, 11:05pm

the 16 + 16 + 4 config is for 1549/19

1548/18 is 8 + 8 + 4

and 1547/17 is 4 + 4 + 4

-fno-builtin was already specified in the failed example.

blondguy · October 30, 2020, 11:10pm

hs2:

It‘s likely not the cause of the trouble you‘re in but better swap queue creation and enabling the interrupt with its ISR using the queue this way:
_queue_hdl_rx = xQueueCreateStatic( QUEUE_LENGTH_RX,... )
NVIC_EnableIRQ(RITIMER_IRQn);
I enable an interrupt last e.g. in its already started serving task as part of the preamble before entering the task forever loop.

Yes, I agree, and in a a real world project I would do that. Here I was just trying to compartmentalize operations for the sake of ease of reading

blondguy · October 31, 2020, 2:20am

and the plot thickens…

Changing the main task and the RIT interrupt to incorporate the higher priority task woken feature…

//******************************************************************************
static void main_task( void* ctx )
//******************************************************************************
{
	Chip_RIT_ClearIntStatus(LPC_RITIMER);
	NVIC_EnableIRQ(RITIMER_IRQn);
	while(1){
		xQueueReceive( _queue_hdl_rx , &main_event , portMAX_DELAY );

		Chip_GPIO_SetPinToggle(LPC_GPIO, GREEN_PORT, GREEN_PIN);
	}
}

//******************************************************************************
void RIT_IRQHandler(void)
//******************************************************************************
{
	static BaseType_t higher = 0;
	higher = 0;
	Chip_RIT_ClearIntStatus(LPC_RITIMER);

	xQueueSendFromISR( _queue_hdl_rx, &frame, &higher );
	if (higher){
		portYIELD_FROM_ISR(higher);
	}
}

This causes the first assertion in prvCopyDataFromQueue to fail immediately on the first loop through. Same results of the arguments in so far as where they point to: the idle stack.

Also interestingly here, if the BaseType_t variable ‘higher’ is not static and is a local, there is an immediate hard fault that is different from the main problem (???)

rtel · October 31, 2020, 4:37am

Clutching at straws because it should be fine, but a couple of other things to try:

Increasing configMINIMAL_STACK_SIZE.
Removing -fno-built in (as it was already defined!)

rtel · October 31, 2020, 5:46am

A couple of other points I’m wondering, but can’t find in the project:

Is the stack used by main large enough? The stack used by main is reused as the stack used by interrupts after the scheduler has started. If it were too small then it could cause interrupt stacks to overflow. This is doubtful because normally linker scripts don’t assume multi-threading, so by default allocate a large stack to main().

Are you allocating any memory to the C library heap? As you have configSUPPORT_DYNAMIC_ALLOCATION set to 0 there is no need to allocate any memory to the heap in the linker script unless your application code calls malloc. Is it possible heaps and stacks are clashing? Again, unlikely as the heap does not appear to be used anywhere.

hs2 · October 31, 2020, 9:36am

As you know making higher static is broken in the common case. But also for me it’s a symptom that there something wrong with the main/ISR stack configuration.
There should be some (main) stack related symbols in the linker script which should reserve a certain, large enough area for it. Often the previous area defines the libc heap and/or defines the end of the data sections. They should not overlap with the main stack.
Can you e.g. fill the (currently remaining) main stack with a marker pattern (e.g. 0xa5a5) in the debugger right before starting the scheduler and break into the ISR to check the main stack area (growing downwards starting from MSP) ?
The global linker script stack area symbols should be visible in the debugger as well.

jefftenney · October 31, 2020, 9:31pm

Actually it seems MSP is not corrupted here after all. According to the map file, _vStackTop is 0x2001000, so the MSP is initialized to 0x2001000, not 0x2003000. That still leaves the mystery of what’s happening in this demo application, but I don’t think MSP is getting corrupted.

However, if MSP is initialized to 0x2001000 in your original application (from your original post), then maybe there’s not enough space available to the ISR stack. As a point of reference, in this demo app in this thread, there’s nearly 2KB available space for the ISR stack. In your original application, presumably there’s a lot less.

For what it’s worth, your linker file indicates that you can define __user_stack_top to override the default value of 0x2001000.

blondguy · November 1, 2020, 5:15am

Yes I realized this late last night. In my main application the three 4k RAM sections are combined into 1 section and the MSP would start at 0x2003000. In this example project - the ‘project wizard’ is only using the first 4k section so the 0x2000fe0 would be correct. I still had the 0x2003000 in the back of my mind

blondguy · November 1, 2020, 5:27am

There is no heap in my main application or the example project. In the case of the main app, there is ~3k of main stack space

I have this setup in my main app, I have markers on all of my task stacks as well as the main stack. I have checked all of these in all of the hangs/faults and the stacks have always looked correct in so far as there is ample remaining free stack space. In the case of the task tasks it is a simple thing to do, in the event of the main stack I have a PROVIDE marker at the end of .bss in linker script and use that to fill memory to the stack pointer used in the vector table. Given that I have no malloc, all of that is stack space.

blondguy · November 2, 2020, 6:27pm

I have now reproduced this on a NXP dev board for the LPC1549: https://www.nxp.com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mcus/lpc1500-cortex-m3/lpcxpresso-board-for-lpc1549:OM13056

All settings as created by the MCUXpresso new project wizard.

rtel · November 2, 2020, 7:07pm

I noted in the datasheet that at least one of the RAM blocks has to be explicitly enabled in the code - can you confirm that is being done?

blondguy · November 2, 2020, 7:26pm

The bootloader which runs first on all LPC parts enables the first block on all parts. The remaining 2 block are turned on by a default reset of the system control block.

blondguy · November 3, 2020, 12:47am

So after a day of banging on this more…

Problem does NOT manifest on a cortex-m0 (LPC824) which is not too surprising given the difference in interrupt handling.

Changing the little test case to replace the interrupt timer feeding the Q with a second task feeding the Q then delaying 200 mS also does NOT manifest the problem.

It is interesting though that if the tick rate is changed from 1000 to 100 (using the problematic test case) the runtime increased to 1 hour before failure. That implies to me that there is some type of nested interrupt problem. Systick fires on top of the queue feeding interrupt and cause failure in some manor.

Looking closer at the values when assert fires, going back to the xQueueReceive frame, both R7 & LR are wrong. Makes sense since the args passed into prvCopyDataFromQueue are referenced off of R7.

It appears in the disassembly that most all functions push R7 & LR on entry and pop on exit. Is this a requirement of the ABI? But it seems that in the case that in the case of a nested interrupt those 2 are not getting returned to the correct stack for that thread. Does that make sense?

How do I test that theory? What is the call stack in that scenario?

jefftenney · November 3, 2020, 1:26am

I think I found the problem. In your FreeRTOSConfig.h, configLIBRARY_LOWEST_INTERRUPT_PRIORITY is set to 8. It should be 7. It should always be (1 << configPRIO_BITS) - 1.

Using 8 is causing the PendSV interrupt to have priority 0, the highest priority in the system. To work correctly, PendSV must be configured for the lowest priority in the system.