Cortex-A9 port cause FreeRTOS_Undefined exception

supergaute wrote on Friday, November 23, 2018:

I’m using FreeRTOS 10, with Xilinx’s Zynq 7000 Chip. This is running Linux on core 0 and FreeRTOS on core 1.

When apply load to core 1, FreeRTOS will eventually crash to the FreeRTOS_Undefined exception handler.
This is the stack trace:

MyFreeRTOSApp	
	Thread #1 57005 (Suspended : Breakpoint)	
		FreeRTOS_Undefined() at port_asm_vectors.S:96 0x30000040	
		0x10101018	

R14_und is 0x10101018, this looks like the contents of the stack (R10) as set in port.c

How can i figure out what causes this exception?

rtel wrote on Friday, November 23, 2018:

Do you know which instruction generated the fault (that is, the value of
the program counter at the time the fault occurred)? It should be
obtainable from within the exception handler. I have example of how to
obtain the offending PC for Cortex-M code, but can’t recall how to do it
for Cortex-A.

Other than that - have you looked through the list of usual suspects
here: https://www.freertos.org/FAQHelp.html Pay particular attention to
the interrupt priority requirements.

supergaute wrote on Friday, November 23, 2018:

Yes, the PC was 0x10101014 at the time the fault occured. This is way outside the valid program space which is 0x30000000 - 0x3800000.

I’ve read through the general FAQ and the Cortex-A specific article.

0x10101014 is 1 word above 0x10101010 which is initialized in port.c.
If i change 0x10101010 to for instance 0x12301010 in port.c, this will be reflected in the crash, where PC is now 0x12301014.

rtel wrote on Friday, November 23, 2018:

Right - agree that value almost certainly must have come from an
initialised register value - looks like it has been used to hold a byte,
hence the rest of the register is untouched. This could be a stack
issue then, where returning from a function or interrupt, etc. has
resulted in the wrong value being popped into the PC (by which I really
mean, the address used to pop the PC was wrong as the stack pointer was
wrong or stack corrupted).

I think the first thing to do is check which task was running at the
time, assuming it was a task, not an interrupt. You can do that by
adding “(tskTCB*)pxCurrentTCB” to the expressions window in the debugger

  • that should then decode pxCurrentTCB as a task control block structure
    that can be expanded to see the task’s name as a string. Alternatively,
    if you store the handles of the tasks you create, the value of
    pxCurrentTCB will equal the task’s handle.

supergaute wrote on Friday, November 23, 2018:

That was the idle task running.
Does that mean it has to be an interrupt?

rtel wrote on Friday, November 23, 2018:

I wouldn’t say it ‘has’ to be an interrupt, but would agree it is very
likely to be an interrupt. Unless you application has added any
functionality to the idle task through an idle task hook function or a
trace macro?

supergaute wrote on Friday, November 23, 2018:

No functionality has been added to the idle task.
I’ve only installed one interrupt handler on top of what the FreeRTOS port does (the tick handler). The interrupt handler is related to OpenAMP, used to communicate to the other core.
I’m not sure how to procede the debugging now, maybe you can give some tips?

supergaute wrote on Sunday, November 25, 2018:

After some more investigation I found out the stack trace changes when I disable optimizations.

The stack trace now becomes this:

Thread #1 57005 (Suspended : Signal : SIGTRAP:Trace/breakpoint trap)	
	FreeRTOS_Undefined() at port_asm_vectors.S:96 0x30000040	
	ucHeap() at 0x31400994	

I also found that the line causing the problem is in tasks.c.
The macro traceTASK_CREATE( ) in the function prvAddNewTaskToReadyList() is defined by Tracealyzer, and it contains calls to portSET_INTERRUPT_MASK_FROM_ISR() and portCLEAR_INTERRUPT_MASK_FROM_ISR().

If I remove the portSET_INTERRUPT_MASK_FROM_ISR() and portCLEAR_INTERRUPT_MASK_FROM_ISR() calls, everything works ok.

If I don’t use Tracealyzer, I can mimic the same behaviour by adding

uint32_t irq_status = portSET_INTERRUPT_MASK_FROM_ISR();
portCLEAR_INTERRUPT_MASK_FROM_ISR(irq_status);

or

portDISABLE_INTERRUPTS();
portENABLE_INTERRUPTS();

after the traceTASK_CREATE() call.
It will execute around 1000 -2000 times before it crashes.

What can be the issue here?

rtel wrote on Sunday, November 25, 2018:

Are you saying the problem is in the trace macro? So if you remove the
trace macros altogether (by not defining them, which makes them take
their default empty implementation) everything runs ok?

supergaute wrote on Sunday, November 25, 2018:

Yes, that is correct.

Will you consider it to be an issue if the trace macro calls portSET_INTERRUPT_MASK_FROM_ISR() and portCLEAR_INTERRUPT_MASK_FROM_ISR()?

richarddamon wrote on Sunday, November 25, 2018:

One big thing I see here is that …FROM_ISR stuff is supposed to be called from inside an ISR, while the traceTASK_CREATE isn’t going to be called from an ISR, but from a task context, so the definition in that macro sounds incorrect.

supergaute wrote on Monday, November 26, 2018:

If I disable Tracealyzer completely, and instead insert

		portENTER_CRITICAL();
		portEXIT_CRITICAL();

right after the traceTASK_SWITCHED_IN() call,
this will result in the same behaviour, the system crashes. It will execute a few thousand times before it crashes.

Should the system behave ok when doing this, or is it expected to crash?

rtel wrote on Monday, November 26, 2018:

I would expect that to crash. traceTASK_SWITCHED_IN() is executed
inside an interrupt - and those macros are not interrupt safe. There
are two reasons I would not expect that to work properly: First exiting
the critical section could result in interrupts becoming enabled in a
part of the code where they should be disabled, and second those macros
are using a critical nesting count that is part of a task’s context -
each task has its own nesting count so using it in the interrupt (which
is not a task) doesn’t make sense - especially if you switch tasks
before exiting the critical section.

In this case is sounds like there could be an issue in the
implementation of the trace macro, which is provided by Percepio.

supergaute wrote on Monday, November 26, 2018:

Would you consider it to be an issue if the trace macro calls portSET_INTERRUPT_MASK_FROM_ISR() and portCLEAR_INTERRUPT_MASK_FROM_ISR()?

Because disabling those lines in the trace implementation fixes the problem.

I just need to figure out if the problem is on my part or Percepio’s.

rtel wrote on Monday, November 26, 2018:

Not sure - they will enable global interrupts, but leave the interrupt
mask in the correct state, and the kernel only really uses the mask. In
that particular place (inside the context switch) they could well be
necessary.

supergaute wrote on Tuesday, November 27, 2018:

Who can answer this?

I’ve replicated the issue on a Xilinx dev board with a “Xilinx lwIP TCP perf” example application, and the latest version of Tracealyzer, so I’m pretty sure my setup is not part of the problem.

rtel wrote on Wednesday, November 28, 2018:

If I recall correctly our last exchange on this was suggesting an issue
in the implementation of the trace macros, which are provided by
Percepio - so you could ask Percepio - they generally have response support.

ldb wrote on Tuesday, December 04, 2018:

Not sure if it is related but I had something similar on the Raspberry PI with preemptive tasking for a very interesting reason.

The trace functions are C code and when you call C code on the ARM abi the stack had to be 8 byte aligned even though the normal alignment for a local variable push etc is only 4. This means you can randomly come into the interrupt with the stack in an align 4 position. So check the stack alignment restrictions on your system as this seems to be common with ARM.

I had to make sure I aligned the stack up before calling out to c code … failing to do so would randomly crash some time later. So my Irq handler ended up looked like this

	/* Save the current context */
	portSAVE_CONTEXT

	/* the stack pointer is 4-byte aligned at all times, but it must be 8-byte aligned	*/
	/* to call external C code	*/
    mov r1, sp
    and r1, r1, #0x7									;@ Ensure 8-byte stack alignment
    sub sp, sp, r1										;@ adjust stack as necessary
    push {r1, lr}										;@ Store adjustment and LR_svc

	bl irqHandler										;@ Call irqhandler

	/* Reverse out 8 byte padding from above */
    pop {r1, lr}										;@ Restore LR_svc
    add sp, sp, r1										;@ Un-adjust stack

	/* restore context which includes a return from interrupt */
	portRESTORE_CONTEXT

supergaute wrote on Wednesday, December 12, 2018:

Was this on the Cortex-A7 or the Cortex-A53 version of the Raspberry Pi?
Were you using this with FreeRTOS?