How to catch code that caused the hard fault

system · July 10, 2017, 2:20am

alsaleem wrote on Monday, July 10, 2017:

Hi,
I have an application that runs for several hours then stops at the default handler.
My code is running repeatedly so there is no new code introduced at the fault time.
I am using STM32F411 with (FreeRTOS V8.2.1)

I am using the code presented here in RTOS site, repeated down:

   .section	.text.Default_Handler,"ax",%progbits
Default_Handler:
  	/* Load the address of the interrupt control register into r3. */
  	 /* NVIC_INT_CTRL_CONST */
  	/*ldr r3, #0xE000ED04 */
  	ldr r3, =SCBICSR
  	/* Load the value of the interrupt control register into r2 from the address held in r3. */
  	ldr r2, [r3, #0]
  	/* The interrupt number is in the least significant byte - clear all other bits. */
  	uxtb r2, r2
Infinite_Loop:
	b	Infinite_Loop (<====)
	.size	Default_Handler, .-Default_Handler

R2 indeed has the value of (3), PC is pointing to (b Infinite_Loop)

I also implemeted the other code for hard fault handler with C function (hard_fault_handler_c) to print variables, but nothing was printed.

    .section	.text.HardFault_Handler
	.weak	HardFault_Handler
	.type	HardFault_Handler, %function
HardFault_Handler:
	TST LR, #4
	ITE EQ
	MRSEQ R0, MSP
	MRSNE R0, PSP
	B hard_fault_handler_c
	.size	HardFault_Handler, .-HardFault_Handler

The vector table is arranged as :

g_pfnVectors:
	.word	_estack
	.word	Reset_Handler
	.word	NMI_Handler
	.word	HardFault_Handler
	.word	MemManage_Handler
	.word	BusFault_Handler
    ....
    ....

The hard fault handler is defined as

  	.weak	HardFault_Handler
	.thumb_set HardFault_Handler,Default_Handler

I’ve also impleneted the stack overflow function hooks, and also nothing was printed.

void vApplicationMallocFailedHook( void ) {
    printf("malloc failed -----------------------------------------------\n");
}
void vApplicationStackOverflowHook( TaskHandle_t xTask, signed char *pcTaskName )
{
    printf("stack overflow in task id %lu, name: %s -------------------------------------------\n", (uint32_t)xTask, pcTaskName);
}

So, could some one point where to catch the code that caused the interrupt and why hard_fault interrupt is not being served?

Thanks.

rtel · July 10, 2017, 3:39am

rtel wrote on Monday, July 10, 2017:

So, if I understand correctly, you have determined that it is the hard
fault that caused the exception, but the hard fault handler is not
executing.

Are you sure the hard fault handler is installed in the vector table?

Did you take note of the Handling Imprecise Faults section on the page
you linked to?

htibosch · July 10, 2017, 6:09am

heinbali01 wrote on Monday, July 10, 2017:

You can test if your HardFault_Handler does get called by putting a break-point in it and execute the following code:

	uint32_t ulAddress = 0xF0937531;
	printf( ( "Divide by zero = %u\n", *( ( unsigned * )ulAddress ) ) );

0xF0937531 is just an unaligned non-implemented memory address.
The printf() should make sure that the dereferencing does take place.

But if your Default_Handler is being called, isn’t there some interrupt that you haven’t set correctly yet?

Maybe you have miss-spelled the name of some Interrupt handler, using the wrong case, e.g. ETH_IRQhandler in stead of ETH_IRQHandler ?

    .weak   HardFault_Handler
    .thumb_set HardFault_Handler,Default_Handler

The above means that the Default_Handler will be called in stead of the HardFault_Handler.
The weak means that the user can override this definition without changing anything to the library.

If you want to override the above weak definition, I would not use weak here:

    .section    .text.HardFault_Handler
    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:

Why don’t you use the C example from the FreeRTOS page that you refer to ?

Slightly modified, with less code:


struct xREGISTER_STACK {
	uint32_t spare0[ 8 ];
	uint32_t r0;
	uint32_t r1;
	uint32_t r2;
	uint32_t r3;
	uint32_t r12;
	uint32_t lr; /* Link register. */
	uint32_t pc; /* Program counter. */
	uint32_t psr;/* Program status register. */
	uint32_t spare1[ 8 ];
};

volatile struct xREGISTER_STACK *pxRegisterStack = NULL;

void prvGetRegistersFromStack( uint32_t *pulFaultStackAddress )
{
	/* 'pxRegisterStack' can be inspected in a break-point. */
	pxRegisterStack = ( struct xREGISTER_STACK *)
		( pulFaultStackAddress - ARRAY_SIZE( pxRegisterStack->spare0 ) );

	/* When the following line is hit, the variables contain the register values. */
	for( ;; );
}
/*-----------------------------------------------------------*/

/* A non-static declaration, not using naked: */
void HardFault_Handler(void)
{
	__asm volatile
	(
		" tst lr, #4                                                \n"
		" ite eq                                                    \n"
		" mrseq r0, msp                                             \n"
		" mrsne r0, psp                                             \n"
		" ldr r1, [r0, #24]                                         \n"
		" bl prvGetRegistersFromStack                               \n"
	);
}
/*-----------------------------------------------------------*/

Your ISR declarations should not be weak.

htibosch · July 10, 2017, 6:16am

heinbali01 wrote on Monday, July 10, 2017:

My sample code is using a macro that is often used in /labs, defined as:

    #define	ARRAY_SIZE( x )	( int )( sizeof( x ) / sizeof( x )[ 0 ] )

pulFaultStackAddress will point to the location where register r0 is stored.
The 16 bytes in spare0 / spare1 sometimes give a bit more information about the process: bytes that were stored before the crash (spare1) and after it (spare0).

system · July 10, 2017, 11:10pm

alsaleem wrote on Monday, July 10, 2017:

Hi Hein,
Sorry for late reply. the problem occurs after 5 hours, so i have to wait until the fault is cought to check.

As you can see above it is not miss-spelled mistake! at least as it looks! And indeed it is hardf ault since R2 = (3)
Nevertheless, I went on to analyze the fault manually with the default handler.
@Real Time Engineers ltd, the fault is precise BFSR @ 0xE000ED29 = 0

I checked the content of memory pointed by MSP (my case 0x2001ff18) and found:

Address 0 - 3 4 - 7 8 - B C - F
000000002001FF10 04600240 6CE60020 686014A0 A4130020
000000002001FF20 686014A0 686014A0 00000000 F1D10008
000000002001FF30 BCC30008 0F000001 A4130020 686014A0
000000002001FF40 C9DD0108 B40A0020 58FF0120 50FF0120
000000002001FF50 67450108 BDAA2000 A0130020 64170020
000000002001FF60 BDAA2000 00000000 70FF0120 CBDF0008
000000002001FF70 AFD20008 50000000 00000000 02000000
000000002001FF80 88FF0120 4FC30008 90FF0120 71860108

R0 = A0146068
R1 = 200013A4
R2 = A0146068
R3 = A0146068
R12 = 0
LR = 0800D1F1
PC = 0800C3BC
PSR = 0100000F
BFAR = 200013A4
CFSR = A0146068
HFSR = 0801DDC9
DFSR = 20000AB4
AFSR = 20000AB4
SCB_SHCSR = 2001FF58

From my .map file, here is what I found (~ PC = 0800C3BC) :

 .text.osSystickHandler
                0x0800c33c       0x18 Middlewares/Third_Party/FreeRTOS/Source/CMSIS_RTOS/cmsis_os.o
                0x0800c33c                osSystickHandler
 .text.vListInitialise
                0x0800c354       0x40 Middlewares/Third_Party/FreeRTOS/Source/list.o
                0x0800c354                vListInitialise
 .text.vListInitialiseItem
                0x0800c394       0x1c Middlewares/Third_Party/FreeRTOS/Source/list.o
                0x0800c394                vListInitialiseItem
 .text.vListInsertEnd
                0x0800c3b0       0x48 Middlewares/Third_Party/FreeRTOS/Source/list.o
                0x0800c3b0                vListInsertEnd
 .text.vListInsert
                0x0800c3f8       0x74 Middlewares/Third_Party/FreeRTOS/Source/list.o
                0x0800c3f8                vListInsert
 .text.uxListRemove
                0x0800c46c       0x54 Middlewares/Third_Party/FreeRTOS/Source/list.o
                0x0800c46c                uxListRemove

And for LR

 .text.xTaskGetTickCount
                0x0800d0d4       0x20 Middlewares/Third_Party/FreeRTOS/Source/tasks.o
                0x0800d0d4                xTaskGetTickCount
 .text.xTaskIncrementTick
                0x0800d0f4      0x17c Middlewares/Third_Party/FreeRTOS/Source/tasks.o
                0x0800d0f4                xTaskIncrementTick
 .text.vTaskSwitchContext
                0x0800d270       0xd4 Middlewares/Third_Party/FreeRTOS/Source/tasks.o
                0x0800d270                vTaskSwitchContext

Now it looks that hard fault was cought in a FreeRTOS code (vListInsertEnd). (Am I right?)

FYI,
I have RTC_WKUP with priority 10U every one second. It reads RTC registers and computes EPOCH using mktime().
EXTI4/EXTI0 with priority 10.
UART1 no ISR (debug)
UART2 ISR priority 10.
I2C/SPI no ISR
I2S DMA ISR priority 10

5 tasks with same priority (=1)

Do I have to upgrade?

Thanks.

system · July 10, 2017, 11:25pm

alsaleem wrote on Monday, July 10, 2017:

Also,
HFSR = 0801DDC9 (<== printf code)

rtel · July 11, 2017, 12:00am

rtel wrote on Tuesday, July 11, 2017:

There is some good debug information here.

As for your question - do I have to upgrade? No, you should not have
to. Newer versions have more assert() statements to help catch
problems, but you should not have issues like this in any version.

What you are describing doesn’t make sense so far - which just means
there is some information missing.

You appear to be entering an ISR that is not defined. If the interrupt
entry is genuine, then it is nothing to do with FreeRTOS as such, as
interrupts are generated by hardware.

However, reading the registers indicates you are in a hard fault, but
the hard fault handler is not being called. Potentially the actual
interrupt handler is itself faulting, but I would still expect the fault
handler to be entered even if the fault occurred inside another interrupt.

There are several fault handlers at the base of the interrupt vector
table. Do they all have their own fault handlers, or do some of them
just go to the default handler. If some go to the default handler then
try adding a unique handler for each to see if you end up in one of those.

When you say:

LR = 0800D1F1

where is that value coming from? When an interrupt is taken the PC
address is pushed onto the task stack before the ISR is entered. Did
you pull the value from the task stack (I can’t see it in the memory
dump)? Inside a non-nested ISR itself the LR should contain an
EXC_RETURN code, not an address.

You could try unwinding the task stack from inside the default handler,
like you are inside the hardfault handler, to find the address of the
instruction that was executing when the interrupt was taken. That would
only be helpful if it was a fault that caused the interrupt entry though

if it is a genuine interrupt then it will be asynchronous to the code
execution.

system · July 11, 2017, 12:52am

alsaleem wrote on Tuesday, July 11, 2017:

LR is the 5th word in MSP, it is shown but LE, PC is the sixth.
Here is the code to print the MSP values, I borrowed names from it. There is also corection on BFSR+ values. :

	stacked_r0 = ((unsigned long) hardfault_args[0]);
	stacked_r1 = ((unsigned long) hardfault_args[1]);
	stacked_r2 = ((unsigned long) hardfault_args[2]);
	stacked_r3 = ((unsigned long) hardfault_args[3]);

	stacked_r12 = ((unsigned long) hardfault_args[4]);
	stacked_lr = ((unsigned long) hardfault_args[5]);
	stacked_pc = ((unsigned long) hardfault_args[6]);
	stacked_psr = ((unsigned long) hardfault_args[7]);

	printf ("\n\n[Hard fault handler - all numbers in hex]\n");
	printf ("R0 = %x\n", stacked_r0);
	printf ("R1 = %x\n", stacked_r1);
	printf ("R2 = %x\n", stacked_r2);
	printf ("R3 = %x\n", stacked_r3);
	printf ("R12 = %x\n", stacked_r12);
	printf ("LR [R14] = %x  subroutine call return address\n", stacked_lr);
	printf ("PC [R15] = %x  program counter\n", stacked_pc);
	printf ("PSR = %x\n", stacked_psr);
	printf ("BFAR = %lx\n", (*((volatile unsigned long *)(0xE000ED38))));
	printf ("CFSR = %lx\n", (*((volatile unsigned long *)(0xE000ED28))));
	printf ("HFSR = %lx\n", (*((volatile unsigned long *)(0xE000ED2C))));
	printf ("DFSR = %lx\n", (*((volatile unsigned long *)(0xE000ED30))));
	printf ("AFSR = %lx\n", (*((volatile unsigned long *)(0xE000ED3C))));
	printf ("SCB_SHCSR = %lx\n", SCB->SHCSR);

BFAR (0xE000ED38) = A014606C
CFSR (0xE000ED28) = 00820000
HFSR (0xE000ED2C) = 00000040
DFSR (0xE000ED30) = 0B000000
AFSR (0xE000ED3C) = 00000000

Yes, I had the same thinking to put a unique handler for each interrupt and just not use the default at all to see which is the one that causing interrupt. Actually I am using only few of them. All look like this, for example :

   .weak      EXTI0_IRQHandler
   .thumb_set EXTI0_IRQHandler,Default_Handler

I will make separate isr for each interrupt.

Regards,

htibosch · July 11, 2017, 1:45am

heinbali01 wrote on Tuesday, July 11, 2017:

I will make separate isr for each interrupt.

I also did that sometimes. You’ll have to do a lot of careful typing, but it can reveil information on the problem.

system · July 11, 2017, 12:28pm

alsaleem wrote on Tuesday, July 11, 2017:

Now, I got the hard fault running after making separate isr for each interrupt and removing default handler. Unfortunately the result is the same as indicared above.

R0 = a0146068
R1 = 200013a4
R2 = a0146068
R3 = a0146068
R12 = 0
LR [R14] = 800d1f1  subroutine call return address
PC [R15] = 800c3bc  program counter
PSR = 100000f
BFAR = a014606c
CFSR = 8200
HFSR = 40000000
DFSR = a
AFSR = 0
SCB_SHCSR = 800

PC & LR indicates addresses inside the FreeRTOS code zone, please see notes before.from .map file.

Any idea?

Thanks

rtel · July 11, 2017, 2:47pm

rtel wrote on Tuesday, July 11, 2017:

If you are 100% sure all your interrupt priorities are as per the
FreeRTOS requirements (nothing with a logical priority above
configMAX_SYSCALL_INTERRUPT_PRIORITY is calling any FreeRTOS API
functions from an ISR), and you have checked everything in the “my
application does not run, what could be wrong?” FAQ, then I suspect some
form of data corruption. That is, something is writing over one of the
RTOS data structures resulting in a hard fault when the structure is
accessed.

system · July 11, 2017, 3:24pm

alsaleem wrote on Tuesday, July 11, 2017:

per FreeRTOSConfig.h.

#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 5
#define configMAX_SYSCALL_INTERRUPT_PRIORITY  ( configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

All of my interrupts (5 interrupts) have priority of 10 which is lower than FreeRTOS’s.
I am not calling any of the FreeRTOS functions inside them.

All my variables are global. I do not use malloc.
I am using FreeRTOS heap4.c

Are any of the FreeRTOS variables (data structures) dependent on FreeRTOS’s alocated stack?

Is there away to know which task that this hard fault appear into while on hard fault (code snippet)? I do not mind digging into memory, but this will give me clue on where that happen.

Thanks.

rtel · July 11, 2017, 3:36pm

rtel wrote on Tuesday, July 11, 2017:

The pxCurrentTCB variable points to the TCB of the currently executing
task. Depending on the debugger, you may have to cast it to a tskTCB
type in the debugger watch window to see its internals, which includes
its name:

(tskTCB*)pxCurrentTCB

system · July 11, 2017, 9:12pm

alsaleem wrote on Tuesday, July 11, 2017:

From my curiosity while waiting the hard fault exception:
(1) It is mentioned in this, re-quoted again

Also, some processors could generate a fault or exception in response to a stack corruption before the RTOS kernel overflow check can occur.

Can you suggest a method to detect this situation ?

(2) The below code is my implementation of the stack overflow check:

void vApplicationStackOverflowHook( TaskHandle_t xTask, signed char *pcTaskName )
{
    printf("stack overflow in task id %lu, name: %s \n", (uint32_t)xTask, pcTaskName);
}

Now, this function is used to report stack overflow and the same time it uses stack!!
Could you suggest an implementation where I do not use a stack to print/report error message?
Note: On my previous message on reporting the hard fault, LR is showing the address of the printf. I think this may lead to locating the cause.

(3) As mentioned :

Stack overflow is by far the most common source of support requests. The size of the stack available to a task is set using the usStackDepth parameter of the xTaskCreate() or xTaskCreateStatic() API function.

Suggestion : why do not make a safe space/threshold to report stack overflow before it gets into this delimma. Size may be a #define.

Thanks.

rtel · July 11, 2017, 10:15pm

rtel wrote on Tuesday, July 11, 2017:

Can you suggest a method to detect this situation ?

Not easily.

(2) The below code is my implementation of the stack overflow check:

void vApplicationStackOverflowHook( TaskHandle_t xTask, signed char *pcTaskName )
{
printf(“stack overflow in task id %lu, name: %s \n”, (uint32_t)xTask, pcTaskName);
}

You should not try to return from a stack overflow - it is a fatal error
(unless you are using the MPU version, in which case the overflow is
trapped before it occurs). You can implement the stack overflow hook
simply as:

void vApplicationStackOverflowHook( TaskHandle_t xTask, signed char
pcTaskName )
{
// To ensure nothing else executes.
DisableInterrupts(); // Psuedocode only.

 // To make sure this function never exits.
 for( ;; );

}

Then place a break point on the infinite loop.

(3) As mentioned :
Stack overflow is by far the most common source of support requests.
The size of the stack available to a task is set using the
usStackDepth parameter of the xTaskCreate() or xTaskCreateStatic()
API function.
Suggestion : why do not make a safe space/threshold to report stack
overflow before it gets into this delimma. Size may be a #define.

There already is - its eating into the space/threashold that triggers
the overflow hook (if the stack overflow configuration parameter is set
to 2).

system · July 11, 2017, 10:20pm

alsaleem wrote on Tuesday, July 11, 2017:

I got the hard fault
I made a break point into vApplicationStackOverflowHook( TaskHandle_t xTask, signed char *pcTaskName ), ==> stopped there ==> hard fault

pxCurrentTCB points to a very simple task I created to show health on debug port as below.

void tskDum( void *pvParameters )
{
	TickType_t tickCnt;
	int i=0;

	printf("dum: dum start ...\n");

    for( ;; )
    {
    	tickCnt = xTaskGetTickCount();

    	printf("dum: dum run %d, %d, %u\n", i, (int)tickCnt, rtcEpoc);
    	i++;
		HAL_Delay(5000);
    }
}

tRet = xTaskCreate( tskDum, "dum", 200, NULL, 1, &hTaskDum);

HAL_Delay is STM32F4 HAL function.
rtcEpoc is updated by rtc WKUP ISR (priority = 10).

[Hard fault handler - all numbers in hex]
R0 = a0146068
R1 = 200013a4
R2 = a0146068
R3 = a0146068
R12 = 0
LR [R14] = 800d1f1  subroutine call return address
PC [R15] = 800c3bc  program counter (<==== RTOS code)
PSR = 101000f
BFAR = a014606c
CFSR = 8200
HFSR = 40000000
DFSR = b
AFSR = 0
SCB_SHCSR = 800

Is the pxCurrentTCB really showing the current task? because this is a simple task.

Regards,

rtel · July 11, 2017, 10:25pm

rtel wrote on Tuesday, July 11, 2017:

I’m not following. Are you saying you went into the stack overflow
hook, and then (while inside the hook) a hard fault is generated? If so
then it sounds like the hard fault is generated either by the code in
the overflow hook function or when attempting to return from the
overflow hook function (see my previous reply).

It may be a simple function, but it is using printf() - and printf()
can, depending on the implementation of the library, use masses of
stack space. That is why embedded systems often have cut down versions
of printf/sprintf - you will find such cut down versions in the FreeRTOS
download.

system · July 11, 2017, 10:35pm

alsaleem wrote on Tuesday, July 11, 2017:

Yes, I am returning from stack overflow hook. And, true, the hard fault maybe be caused by printf code in it.
I am using the stdio printf(). But does it keep the allocated space (i.e memory leak) (not freed)? because it runs for 5 hours.

Regards,

rtel · July 11, 2017, 11:45pm

rtel wrote on Tuesday, July 11, 2017:

Yes, I am returning from stack overflow hook.

I have said twice not to do that.

And, true, the hard fault
maybe be caused by printf code in it.
I am using the stdio printf(). But does it keep the allocated space (i.e
memory leak) (not freed)? because it runs for 5 hours.

printf() will use stack, and may use the heap. If it uses the stack
then the stack space will be returned when the function exits. printf()
often also calls malloc(), perhaps unexpectedly for some developers, in
which case the memory it allocates should be freed again - assuming
there are no bugs in the printf() implementation.

However, you are missing the point - the overflow hook is only called
AFTER the stack has already overflowed. Calling printf() when you know
there is no stack space cannot be recommend. In fact, calling printf()
in a small embedded system is rarely recommended at all unless you know
how it is implemented. For example, it is very unlikely to be thread safe.

system · July 12, 2017, 12:16am

alsaleem wrote on Wednesday, July 12, 2017:

Thanks.
I have disabled the printf and run again to verify printf is the problem.

I see you recommend (printf-stdarg.c) for printf.
I will use it to see if it is a printf problem or other, I am using sprintf in other tasks too

I do not know if the one (gcc) I have is thread-safe, does not leak memory, or bug-free.

Regards,