vTaskDelay() hard fault on priority higher than idle task

maharvey · April 12, 2023, 3:48pm

I believe that the ISR stack is not overflowing, because I inspected the content and 90% of it is all 0xee, which is the portISR_STACK_FILL_BYTE value.

Nevertheless I increased the ISR stack size to 4096 words and saw no difference.

Looking at the assembly in ISR_Support.h I see that the stack pointer is swapped to the ISR stack before saving the FPU context. So I believe it is normal for the stack pointer register to be within the ISR stack range when the data is overwritten.

/* Swap to the system stack. */
	la			sp, xISRStackTop
	lw			sp, (sp)
[...]

/* Save the FPU context if the nesting count was zero. */
	#if ( __mips_hard_float == 1 ) && ( configUSE_TASK_FPU_SUPPORT == 1 )
		la			s6, uxInterruptNesting
		lw			s6, 0(s6)
		addiu		s6, s6, -1
		bne			s6, zero, 1f
		nop

		/* Test if the current task needs the FPU context saving. */
		lw			s6, portTASK_HAS_FPU_STACK_LOCATION(s5)
		beq			s6, zero, 1f
		nop

		/* Save the FPU registers. */
		portSAVE_FPU_REGS ( portCONTEXT_SIZE + 8 ), s5

		/* Save the FPU status register */
		cfc1		s6, $f31
		sw			s6, (portCONTEXT_SIZE + portFPCSR_STACK_LOCATION)(s5)

		1:
	#endif

aggarg · April 12, 2023, 3:57pm

The question is why is any of your data lying in that ISR stack range which is getting overwritten? Where are you declaring that data? Can you share the code snippet?

maharvey · April 12, 2023, 4:03pm

I posted my code a bit earlier in this thread. The corrupted data structure is “xfer”, a local variable declared in the task function. It is guaranteed by design the function does not return until the ISR is done transferring the data.

RAc · April 12, 2023, 5:46pm

Another shot in the dark: Are you allocating automatic variables in main() that your code tries to use after the scheduler has started? I am sure you know that that is a no-go…

maharvey · April 12, 2023, 6:08pm

The code doesn’t do that. All variables accessed by the main function are globals.

aggarg · April 18, 2023, 4:19am

We had a call and it turned out that the variable that was being overwritten by the context save code was already out of scope.

maharvey · April 24, 2023, 3:42pm

In case someone finds this thread in the future. @aggarg helped me find there was a race condition between the ctx->rx_xfer going out of scope and the last RX interrupt. This case could only happen when bytes were transmitted but not read back. In that case there would be an extra RX interrupt that would fire after the done flag was set, which the task was using to free up the memory. To further complicate things, this race condition would only happen when the priority of the task was elevated.

Changing the logic of the done flag to count the number of bytes received even if bytes are not read back fixed the issue.

Case closed.