HardFault on ARM Cortex-M0+

mramirez · July 9, 2022, 12:21am

FreeRTOS: v10.3.1
CPU: STM32G0 Series

Problem: When inserting a USB charging cable into my device I am getting a HardFault

[HardFault]
r0  = 0x00000001
r1  = 0x00000004
r2  = 0x00000001
r3  = 0x00000000
r12 = 0x00000003
lr  = 0x0802c48f
pc  = 0x0801ffe2
psr = 0x01000000

r0-r3 and r12 look corrupted causing the crash. I have checked the various stacks in the system and it is not overflow.

It looks to be register corruption caused by an interrupt. Register storing and restoration for interrupts is handled by the ARM CPU directly. Interrupts in play during this time would be i2c interrupts to talk to the USB controller.

Changing the code just a little bit makes the problem go away. This probably moves the timing of the interrupt.

The ARM we are using does not have the Micro Trace Buffer (MTB).

Thanks in advance for any ideas on how to track this down.

0801ff9c <ProtocolIdle>:
 801ff9c:	22e1      	movs	r2, #225	; 0xe1
 801ff9e:	b530      	push	{r4, r5, lr}
 801ffa0:	0052      	lsls	r2, r2, #1
 801ffa2:	5c83      	ldrb	r3, [r0, r2]
 801ffa4:	0004      	movs	r4, r0
 801ffa6:	b083      	sub	sp, #12
 801ffa8:	2b01      	cmp	r3, #1
 801ffaa:	d027      	beq.n	801fffc <ProtocolIdle+0x60>
 801ffac:	213f      	movs	r1, #63	; 0x3f
 801ffae:	5c41      	ldrb	r1, [r0, r1]
 801ffb0:	07c9      	lsls	r1, r1, #31
 801ffb2:	d514      	bpl.n	801ffde <ProtocolIdle+0x42>
 801ffb4:	2594      	movs	r5, #148	; 0x94
 801ffb6:	00ad      	lsls	r5, r5, #2
 801ffb8:	5d63      	ldrb	r3, [r4, r5]
 801ffba:	2b00      	cmp	r3, #0
 801ffbc:	d128      	bne.n	8020010 <ProtocolIdle+0x74>
 801ffbe:	23c4      	movs	r3, #196	; 0xc4
 801ffc0:	33ff      	adds	r3, #255	; 0xff
 801ffc2:	5ce3      	ldrb	r3, [r4, r3]
 801ffc4:	2b00      	cmp	r3, #0
 801ffc6:	d01e      	beq.n	8020006 <ProtocolIdle+0x6a>
 801ffc8:	23e2      	movs	r3, #226	; 0xe2
 801ffca:	2201      	movs	r2, #1
 801ffcc:	005b      	lsls	r3, r3, #1
 801ffce:	54e2      	strb	r2, [r4, r3]
 801ffd0:	2201      	movs	r2, #1
 801ffd2:	343c      	adds	r4, #60	; 0x3c
 801ffd4:	78e3      	ldrb	r3, [r4, #3]
 801ffd6:	4393      	bics	r3, r2
 801ffd8:	70e3      	strb	r3, [r4, #3]
 801ffda:	b003      	add	sp, #12
 801ffdc:	bd30      	pop	{r4, r5, pc}
 801ffde:	21e2      	movs	r1, #226	; 0xe2
 801ffe0:	0049      	lsls	r1, r1, #1
 **801ffe2:	5c41      	ldrb	r1, [r0, r1] <-------Crash Point**
 801ffe4:	2900      	cmp	r1, #0
 801ffe6:	d1e5      	bne.n	801ffb4 <ProtocolIdle+0x18>
 801ffe8:	2b02      	cmp	r3, #2
 801ffea:	d1f6      	bne.n	801ffda <ProtocolIdle+0x3e>
 801ffec:	23c4      	movs	r3, #196	; 0xc4
 801ffee:	33ff      	adds	r3, #255	; 0xff
 801fff0:	5cc3      	ldrb	r3, [r0, r3]
 801fff2:	2b00      	cmp	r3, #0
 801fff4:	d01c      	beq.n	8020030 <ProtocolIdle+0x94>
 801fff6:	2308      	movs	r3, #8
 801fff8:	5483      	strb	r3, [r0, r2]
 801fffa:	e7ee      	b.n	801ffda <ProtocolIdle+0x3e>

rtel · July 9, 2022, 12:48am

For clarity - you say this is a USB charging cable - so it is just providing power to the device?

mramirez · July 9, 2022, 1:10am

The device is on battery power and powered on. Plugging in the USB charging cable connected to a wall charger to power/charge the battery is triggering the crash.

rtel · July 9, 2022, 1:39am

Ok - wanted to check there is no USB traffic, just power. If powering the device, or at least switching from battery to external power, is enough to cause a crash it would sound more like a hardware issue. At least, from the information available so far. Does this happen on all your hardware devices, or just some and not others? How mature is the hardware?

mramirez · July 9, 2022, 2:24am

It is happening on all the HW devices that we have tested at least 20 units. The HW is relatively new, so voltage to the MCU could be an issue. I assume that is what you mean by a possible HW issue. We will check this. Thanks!

jefftenney · July 9, 2022, 4:08am

Does the USB controller induce an interrupt on the MCU when VBUS is detected? If so, you might still be looking at a software issue.

Are R4 and R5 also corrupted? They are also being used in this function and should have reasonable/verifiable values. Very curious to have some registers corrupted but still have reasonable values in lr and pc. They are stored right next to R0-R3 in the TCB.

Does the USB (I2C) ISR induce a task switch?

mramirez · July 11, 2022, 7:23pm

That is correct there is an interrupt to the MCU from the USB controller. The interrupt handler posts a semaphore (xSemaphoreGiveFromISR), which wakes up a task to do I2C reads that use an interrupt and a semaphore to wait for the response in the same task context. I will need to check on if r4 and r5 are also corrupted.

aggarg · July 12, 2022, 4:56am

If this interrupt triggers only when you connect the cable, is it possible for you to break in that ISR and examine the state?

mramirez · August 5, 2022, 2:00am

In my latest findings, it seems this the root cause is not initializing xHigherPriorityTaskWoken in the below code from the USB ISR:

    if ( FUSB_DDATA(ctx)->fusb_event_sema ) {
        BaseType_t xHigherPriorityTaskWoken;

        /* Signal Event */
        xSemaphoreGiveFromISR( FUSB_DDATA(ctx)->fusb_event_sema, &xHigherPriorityTaskWoken );

        if ( xHigherPriorityTaskWoken )
            portYIELD_FROM_ISR( xHigherPriorityTaskWoken );
    }

The call to xSemaphoreGiveFromISR only sets xHigherPriorityTaskWoken to true, never to false, so if it is false, it will be a random value from the stack.

I’ve yet to fully prove this, but I do see a difference in the PSP (process stack pointer) values before the above code runs (0x20016548) and after (0x20016568)

aggarg · August 5, 2022, 5:23am

That does seem right. Does initializing it solves your problem?

mramirez · August 20, 2022, 12:49am

It turns out that was not the root cause. The root cause is a bug in the MCU. From the errata:

2.2.10 Prefetch failure when branching across Flash memory banks Description

In rare cases, the code prefetch may fail upon branching and function calls across Flash memory banks, regardless of the DUAL_BANK and nSWAP_BANK option byte settings. The failing prefetch then provides an incorrect data to the CPU, which causes code execution corruption and may lead to HardFault interrupt.

Note: The following uses of the dual bank functionality remain safe:

EEPROM emulation or other data storage in bank 2

bank 2 used as a download or backup slot for firmware

mirroring the code in the banks and using bank swapping upon reset

branching and function calls across Flash memory banks, with the prefetch function deactivatedWorkaroundNone.

We will probably have to keep the prefetch turned off.

Moral of the story is to study the errata! Thanks everyone for the input!

aggarg · August 22, 2022, 1:53pm

Thank you for taking time to report back your solution.