Tracking down the cause of an unaligned memory access exception

jkn · August 1, 2024, 12:45pm

Hi All
I’m looking for general thoughts/advice on trying to track down an unaligned memory access exception.

We are occasionally seeing this on an embedded system with only a few tasks, but nested interrupts and potentially many interrupts occurring. From the debugger, it looks like a misaligned address is being pushed to the stack within the dispatcher, and a later attempt to pop that is causing the exception.

We have watermarked the stack and (recently) enabled stack checking via configCHECK_FOR_STACK_OVERFLOW and added a hook for vApplicationStackOverflowHook(). So far that is not showing any problems.

FWIW we have FreeRTS v10.4.6, but have our own dedicated ISR stack (we might revert that latter part as part of the debug).

We do not have any dynamic memory allocation and use static resource creation.

This is using an SoC CPU which is not part of the ‘blessed’ FreeRTOS ported ecosystem, which I realise doesn’t help. However we have not had issues previously, he said unhelpfully.

This is not something I have had to debug before; what kind of thing might we need to be looking for?

Thanks & Regards, Jon N

jkn · August 1, 2024, 1:24pm

Actually, I realise I am able to add some more information regarding the port. We are using the (Synopsys) ARC_EM_HS port of FreeRTOS (found in /portable/ThirdParty/GCC/ARC_EM_HS/)

J

karahulx · August 1, 2024, 2:46pm

Hi @jkn
Some pointers you can check , which I suspect you will have looked at already:

Do you have configASSERT() defined?
Do you have stack overflow checking set to 2?
Can you use the latest version of FreeRTOS Kernel v11.1.0, as the more recent the more assert() points there are to catch interrupt priority problems.

Links can be found on this page: FreeRTOS - Open Source RTOS Kernel for small embedded systems

Then, are you sure your interrupt stack is large enough? Interrupts use the same stack as main(), so the stack is set up by your code run time, often in the linker script (depending on the tools). Overflows in the stack used by interrupts will not be caught by the stack overflow detection.

jkn · August 1, 2024, 4:53pm

Hi Rahul
thanks for the suggestions:

We don’t have configASSERT() defined - I will get that change made
we have (recently) set the stack checking to 2 and are looking to see if that shows anything. I am guessing that slows things down a little?
I take your point about trying with kernel v11.1.0. I will have to see how straightforward that is for us right now

Regards, Jon N

aggarg · August 1, 2024, 4:56pm

Can you provide more information about this? Which address is this? Is this a return address that is pushed to the stack and later an instruction (e. pop {pc}) tries to jump to the misaligned address? If yes, does the address look legitimate i.e. can you look at the code around this address?

jkn · August 1, 2024, 6:44pm

Hi Gurav
yes, that seems to be broadly the situation. I had thought of the approach you suggest- look around the ‘bad’ address - but was unsure whether that was likely to be worthwhile. I will see if we can learn anything from that approach.

Thanks, Jon N

aggarg · August 2, 2024, 5:13am

If the code looks correct around the address and we can see find the function call, then it would likely be in alignment issue. We would look for things like - if we are specifying an alignment in the code or the code is written in assembly and the alignment is not specified. On the other hand, if the code around the address does not make sense, then it is likely a stack corruption.

jkn · August 6, 2024, 11:40am

I can give a bit of an update on this:

we are (currently) not getting the unaligned memory exception; this after rearranging the relative priority of some tasks. Time will tell whether it returns…
We are still seeing a (related) problem, which I didn’t describe earlier. We get into a situation where there appear to be no tasks running (including IDLE). I thought that it was the misalignement which was causing this, but perhaps not.

We did a bit of tracking down and found that we are getting stuck inside xTaskIncrementTick(). Specifically we can trigger a configASSERT at this point in the FreeRTOS code:


/* tasks.c, ~Line 2790, in xTaskIncrementTick() */

                else
                {
                    /* The delayed list is not empty, get the value of the
                     * item at the head of the delayed list.  This is the time
                     * at which the task at the head of the delayed list must
                     * be removed from the Blocked state. */
                    pxTCB = listGET_OWNER_OF_HEAD_ENTRY( pxDelayedTaskList ); /*lint !e9079 void * is used as this macro is used with timers and co-routines too.  Alignment is known to be fine as the type of the pointer stored and retrieved is the same. */

-->              configASSERT(pxTCB);  <-- this gets asserted
                    xItemValue = listGET_LIST_ITEM_VALUE( &( pxTCB->xStateListItem ) );

I presume ‘delayed task list’ holds the list of Blocked tasks? Maybe there is some historical reason for the name.

Anyway, something is clearly going awry here. Any thoughts?

Thanks, jon N

RAc · August 6, 2024, 12:27pm

What you descibe is frequently the result of either of those problems:

Misimplementation of the critical section. Are you using a custom port or one from FreeRTOSs github repo?
Misconfiguration of interrupt priorities. The SVC handler isr must have the lowest interrupt priority.
Violation of the isr rules, for example have an isr with pri greater than max_syscall submit system calls or use a non xxxFromISR system call.

jkn · August 6, 2024, 2:37pm

Thank you for these comments:

we are using one of the ‘third party’ ports in the github repo
Can you say a bit more about this? I am not understanding what you are saying here
I do not think we are violating any of these rules, but I will double-check.

richard-damon · August 6, 2024, 2:48pm

The code for the SVC Handler assumes that it can not interrupt any running interrupt, so it must be the/a lowest priority interrupt in the system.

RAc · August 6, 2024, 2:56pm

Thanks, Richard, almost correct, except it must read THE, not the/a.

Here is more info for the TO:

About the “configKERNEL INTERRUPT PRIORITY” value - Kernel - FreeRTOS Community Forums

jkn · August 6, 2024, 3:02pm

Ah - SVC Handler is an ARM-specific term, I think. I am not using an ARM.

However I take the point about the priority of the SWI or TRAP. I will read a bit more on the link you provide.

Thanks & Regards, Jon N

jkn · August 6, 2024, 5:24pm

Hmm, here is something that seems interesting …

I use taskENTER_CRITICAL_REGION_FROM_ISR() and taskEXIT_CRITICAL_REGION_FROM_ISR(). These are defined (task.h) in terms of portSET_INTERRUPT_MASK_FROM_ISR and portCLEAR_INTERRUPT_MASK_FROM_ISR.

My port has got definitions for portENTER_CRITICAL and portEXIT_CRITICAL … but not for portSET_INTERRUPT_MASK_FROM_ISR and portCLEAR_INTERRUPT_MASK_FROM_ISR.

That has to be wrong, surely?

It’s not causing a compilation error due to this in FreeRTOS.h:

#ifndef portSET_INTERRUPT_MASK_FROM_ISR
    #define portSET_INTERRUPT_MASK_FROM_ISR()    0
#endif

richard-damon · August 6, 2024, 6:37pm

If the port does not support nested interrupts, it will do that with the portSET_INTERRUPT_MASK_FROM_ISR

If you are trying to use nested interrupts, you need to figure out how to define those macros.

richard-damon · August 6, 2024, 8:17pm

No, its the/a as there is no problem if it shares the lowest priority with other interrupts, and in fact for the ARM-M, it typically does as it share that lowest priority with the tick interrupt, both having the priority of configKERNEL_INTERRUPT_PRIORITY

aggarg · August 7, 2024, 5:44am

Yes, this looks wrong. What are you using taskENTER_CRITICAL_REGION_FROM_ISR() and taskEXIT_CRITICAL_REGION_FROM_ISR() for? Is it possible to remove the functionality to confirm if that is the issue?

RAc · August 7, 2024, 7:55am

Apologies, of course you are correct. At least on an ARM, equal priorities imply mutual non-preemption, thanks for the clarification.

jkn · August 7, 2024, 12:09pm

Hi Guarav, thanks for the comment. However, thinking a bit further, although this may be an issue in our port I don’t think it can be the cause of our problem.

The (single) use of taskENTER/EXIT_CRITICAL region in our firmware is in an ISR, to protect a small structure (a couple of entries). This structure gets updated with ‘return values’ by a task, and then read by the ISR to use when communicating externally (it’s an I2C ISR). So:

/* my_task.c */

taskENTER_CRITICAL();
// update fields in my critical region structure
taskEXIT_CRITICAL();


/* my_isr.c */
taskENTER_CRITICAL_REGION_FROM_ISR();
// read contents of critical region structure
taskEXIT_CRITICAL_REGION_FROM_ISR();

// use read contents

If the ENTER/EXIT...FROM_ISR() macros are null, then the only this that will occur is that the values read from the critical region might be wrong. That is a ‘benign’ error for our case right now. I don’t see it causing the issue with the ‘null’ pxTCB??

Thanks & Regards
Jon

aggarg · August 7, 2024, 12:26pm

The macro taskENTER_CRITICAL_FROM_ISR is called from within the kernel as well and as @richard-damon mentioned, if your port supports nested interrupts, then you need portSET_INTERRUPT_MASK_FROM_ISR and portCLEAR_INTERRUPT_MASK_FROM_ISR.

P.S. - I guess taskENTER_CRITICAL_REGION_FROM_ISR is just a typo and it is actually taskENTER_CRITICAL_FROM_ISR or are you using a modified copy of kernel?