Intermittent CM4 FAULT! Bus Fault! in floating point app

carlk3 · June 21, 2020, 2:40pm

This looks like a hard one.

Intermittently, like once a day, I’m getting this:

CM4 FAULT!!
SCB->CFSR = 0x00008200
Bus Fault!
Fault address = 0x01000000
r0 = 0x00000000
r1 = 0xffffffff
r2 = 0x08045a48
r3 = 0xfff783ef
r12 = 0x100a875f
lr = 0x100a9450
pc = 0x81010000
psr = 0x3e01ec70

The call stack is:

1 Cy_SysLib_FaultHandler(const uint32_t * faultStackAddr = <optimized out>) Generated_Source\PSoC6\pdl\drivers/peripheral/syslib/cy_syslib.c 444 0x1008517A (All) 
2 UsageFault_Handler() gcc/startup_psoc6_01_cm4.S 455 0x1008034E (All) 
3 <signal handler called>() ?????? ?????? 0xFFFFFFED (All) 
4 prvPortStartFirstTask() Generated_Source\PSoC6\pdl\rtos\FreeRTOS\10.0.1\Source\portable\GCC\CM4F/port.c 261 0x1008045C (All) 
5 xPortStartScheduler() Generated_Source\PSoC6\pdl\rtos\FreeRTOS\10.0.1\Source\portable\GCC\CM4F/port.c 367 0x1008334C (All) 
6 cy_ble_flashStorage() ?????? ?????? 0x14000E00 (All)

prvPortStartFirstTask() is:

/*-----------------------------------------------------------*/

static void prvPortStartFirstTask( void )
{
	/* Start the first task.  This also clears the bit that indicates the FPU is
	in use in case the FPU was used before the scheduler was started - which
	would otherwise result in the unnecessary leaving of space in the SVC stack
	for lazy saving of FPU registers. */
	__asm volatile(
					" ldr r0, =0xE000ED08 	\n" /* Use the NVIC offset register to locate the stack. */
					" ldr r0, [r0] 			\n"
					" ldr r0, [r0] 			\n"
					" msr msp, r0			\n" /* Set the msp back to the start of the stack. */
					" mov r0, #0			\n" /* Clear the bit that indicates the FPU is in use, see comment above. */
					" msr control, r0		\n"
					" cpsie i				\n" /* Globally enable interrupts. */
					" cpsie f				\n"
					" dsb					\n"
					" isb					\n"
					" svc 0					\n" /* System call to start first task. */
					" nop					\n"
				);
}
/*-----------------------------------------------------------*/

I can’t find Fault address = 0x01000000 anywhere in the Disassembly of my program. However, if I follow link register lr = 0x100a9450 it points here:

0x100A944C b.n	100a9432 <__ieee754_sqrt+0x9a>
0x100A944E bic.w	r6, r1, #80000000	; 0x80000000
0x100A9452 orrs	r6, r0
0x100A9454 beq.n	100a94c0 <__ieee754_sqrt+0x128>
0x100A9456 cmp	r1, #0

I don’t claim to understand what I am looking at here, but I don’t think it’s a coincidence that my app is spending a lot of time in sqrt() and the link register is pointing to <__ieee754_sqrt+0x9a> and the comments in prvPortStartFirstTask( void ) are talking about manipulating Floating Point Unit (FPU) related things.

I’m building with this:

C:\Users\carlk>"C:\Program Files (x86)\Cypress\PSoC Creator\4.3\PSoC Creator\import\gnu\arm\9.3.1\bin\arm-none-eabi-gcc.exe " -v
Using built-in specs.
COLLECT_GCC=C:\Program Files (x86)\Cypress\PSoC Creator\4.3\PSoC Creator\import\gnu\arm\9.3.1\bin\arm-none-eabi-gcc.exe 
COLLECT_LTO_WRAPPER=c:/program\ files\ (x86)/cypress/psoc\ creator/4.3/psoc\ creator/import/gnu/arm/9.3.1/bin/../lib/gcc/arm-none-eabi/9.3.1/lto-wrapper.exe
Target: arm-none-eabi
Configured with: /mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/src/gcc/configure --build=x86_64-linux-gnu --host=i686-w64-mingw32 --target=arm-none-eabi --prefix=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw --libexecdir=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/lib --infodir=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/share/doc/gcc-arm-none-eabi/info --mandir=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/share/doc/gcc-arm-none-eabi/man --htmldir=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/share/doc/gcc-arm-none-eabi/html --pdfdir=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c,c++ --enable-mingw-wildcard --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx-pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-headers=yes --with-newlib --with-python-dir=share/gcc-arm-none-eabi --with-sysroot=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/install-mingw/arm-none-eabi --with-libiconv-prefix=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-gmp=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-mpfr=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-mpc=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-isl=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-libelf=/mnt/workspace/workspace/GCC-9-pipeline/jenkins-GCC-9-pipeline-200_20200521_1590053374/build-mingw/host-libs/usr --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --with-pkgversion='GNU Arm Embedded Toolchain 9-2020-q2-update' --with-multilib-list=rmprofile,aprofile
Thread model: single
gcc version 9.3.1 20200408 (release) (GNU Arm Embedded Toolchain 9-2020-q2-update)

What is my next step in debugging this?

FreeRTOSConfig.h (9.3 KB)

rtel · June 21, 2020, 4:28pm

I don’t think prvPortStartFirstTask() being in the stack frame is relevant here. That function uses an exception to start the scheduler and is only called once - it may just be lingering in the exception stack from from then.

carlk3:

0x100A944C b.n	100a9432 <__ieee754_sqrt+0x9a>
0x100A944E bic.w	r6, r1, #80000000	; 0x80000000
0x100A9452 orrs	r6, r0
0x100A9454 beq.n	100a94c0 <__ieee754_sqrt+0x128>
0x100A9456 cmp	r1, #0

This looks like library code implementation of sqrt(). Interestingly there doesn’t seem to be an instruction on line 0x100a9450, which may be the cause of the exception.

First thing to do would be to go through the list on this page Guidelines - FreeRTOS Community Forums to make sure you have configASSERT() defined, stack overflow detection set to 2, etc. Which version of FreeRTOS are you using? The newer the version the more helpful configASSERT() will be for you.

I also note you are using GCC v9.3.1 - what do you have the optimisation level set to? Try and -O0 to see if that makes a difference.

carlk3 · June 21, 2020, 11:21pm

Makes sense.

I wonder if my fault handler is no good. It never seems to have anything useful in the
Fault address = 0x01000000
line, which is printed by:

/* If Bus Fault valid bit is set to 1, print BusFault Address */
if((cy_faultFrame.cfsr.cfsrReg & SCB_CFSR_BFARVALID_Msk) == SCB_CFSR_BFARVALID_Msk )
{
    printf("Bus Fault! \r\nFault address = 0x%08lx\r\n", (unsigned long)SCB->BFAR);
}

where

CMSIS_SCB     System Control Block (SCB)
uint32_t BFAR;                   /*!< Offset: 0x038 (R/W)  BusFault Address Register */

As you can see in the FreeRTOSConfig.h that I attached, I do have configASSERT defined, configCHECK_FOR_STACK_OVERFLOW 2, and anything else I can think of that might help me track down an intermittent bug. I’m using vTaskList() to keep an eye on free stack space, and xPortGetMinimumEverFreeHeapSize() to monitor heap space. About the only other thing I can think of is to switch over to FreeRTOS-MPU (Memory Protection Unit Support), but I’m afraid that would be a big project.

It says 10.0.1 in this Cypress distribution I’m using. Where’s the best place to look for more specific version info? I can try installing the latest from, …, GitHub - FreeRTOS/FreeRTOS: 'Classic' FreeRTOS distribution. Started as Git clone of FreeRTOS SourceForge SVN repo. Submodules the kernel.? I’m not even sure how to tell what version that is.

Here are my compile options:

-mcpu=cortex-m4
-mfloat-abi=softfp
-mfpu=fpv4-sp-d16
-mthumb
-I ...
-Wa,-alh=${OutputDir}/${CompileFile}.lst
-g
-D DEBUG
-D CY_CORE_ID=0
-D CY_PSOC_CREATOR_USED=1
-D CY8C6347BZI_BLD53
-Wall
-ffunction-sections
-ffat-lto-objects
-flto
-Og

It looks like -Og does do some optimization, so I will try -O0.

aggarg · June 22, 2020, 3:16pm

You mentioned that you followed LR and you see sqrt code, would you also check what PC is pointing to:

pc = 0x81010000

Thanks.

rtel · June 22, 2020, 4:34pm

Github has a couple of ways of getting to released versions, as highlighted on this screenshot.

carlk3 · June 22, 2020, 4:39pm

The Disassembly in my debugger only goes from 0x10080000 to 0x1001DCE0:

0x10080000 <__Vectors>:
   0: Disassembly of section .text:
0x10080000 .word	0x08047800
0x10080004 .word	0x100802cf
0x10080008 .word	0x0000000d
...
0x1001DC34 ; <UNDEFINED> instruction: 0xffff00ff
0x1001DC38 and	lr, r0, r0, lsl #1a
0x1001DC3C eormi	r0, r6, r8, lsl #2
0x1001DC40 eormi	r0, r6, r0
0x1001DC44 eormi	r0, r6, r4, lsl #2
0x1001DC48 andeq	r1, r0, lr, lsl pc
0x1001DC4C andeq	r0, r0, ip, lsl r5
0x1001DC50 bge	eac8700 <__StackTop+0x6a80f00>
0x1001DC54 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DC58 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DC5C svclt	0x00004760
0x1001DC60 andne	r2, r0, r5, lsr r6
0x1001DC64 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DC68 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DC6C svclt	0x00004760
0x1001DC70 andne	r2, r0, fp, lsl #f
0x1001DC74 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DC78 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DC7C svclt	0x00004760
0x1001DC80 strdne	r1, [r0], -r1	; <UNPREDICTABLE>
0x1001DC84 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DC88 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DC8C svclt	0x00004760
0x1001DC90 andne	r1, r0, r1, ror #e
0x1001DC94 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DC98 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DC9C svclt	0x00004760
0x1001DCA0 strdne	r0, [r0], -sp
0x1001DCA4 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DCA8 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DCAC svclt	0x00004760
0x1001DCB0 andne	r0, r0, r9, asr r4
0x1001DCB4 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DCB8 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DCBC svclt	0x00004760
0x1001DCC0 mulne	r0, r3, r7
0x1001DCC4 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DCC8 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DCCC svclt	0x00004760
0x1001DCD0 andne	r1, r0, r5, ror r9
0x1001DCD4 stmdami	r2, {r0, sl, ip, sp, pc}
0x1001DCD8 stclt	6, cr4, [r1], {132}	; 0x84
0x1001DCDC svclt	0x00004760
0x1001DCE0 strdne	r1, [r0], -r5

I don’t know where it is getting addresses like pc = 0x81010000. Maybe I have a wild pointer somewhere.

Last night, I got this:

CM4 FAULT!!
SCB->CFSR = 0x00008200
	Bus Fault - The instruction bus error 
	Bus Fault - The imprecise data bus error
Bus Fault!
Fault address = 0x38373339
r0 = 0x00000000
r1 = 0xfffffffc
r2 = 0x0803cbc0
r3 = 0x80100000
r12 = 0x100a68df
lr = 0x100a75d0
pc = 0x810f0000
psr = 0x40000000

0x100A75D0 bmi.w	100a72a8 <_vfprintf_r+0x9d0>

After doing some reading about imprecise data bus errors, I added this to my main:

/* If ACTLR.DISDEFWBUF is not set to 1, the imprecise BusFault will occur.
 * For the imprecise BusFault, the fault stack information won't be accurate.
 * Setting ACTLR.DISDEFWBUF bit to 1 so that bus faults will be precise.
 * Refer Arm documentation for detailed explanation on precise and imprecise
 * BusFault.
 * WARNING: This will decrease the performance because any store to memory
 * must complete before the processor can execute the next instruction.
 * Don't enable always, if it is not necessary.
 */
SCnSCB->ACTLR |= SCnSCB_ACTLR_DISDEFWBUF_Msk;

I’ll see if that gives me any better diagnostics.

carlk3 · June 22, 2020, 4:43pm

Ah, thanks. I am now running the latest, from
git clone https://github.com/FreeRTOS/FreeRTOS.git --recurse-submodules

carlk3 · June 23, 2020, 2:42pm

Looks like it was a stupid stack problem after all. I picked up some Bluetooth Low Energy (BLE) example code that used a task, a queue, and some software timers to control a couple of status LEDs. I had recently adjusted some task priorities, and the LED task was at a low priority. Apparently, it occasionally got delayed too long and its queue backed up and overflowed its little stack. In the past, vApplicationStackOverflowHook has saved me, but not this time. I have scrapped the whole implementation of the LED flasher and replaced it with a couple of Timer Counter PWM components.

Oh well, it was a learning experience.

rtel · June 23, 2020, 8:29pm

Thanks for taking the time to report back.

carlk3 · June 24, 2020, 4:37pm

Oh no, it is back:

CM4 FAULT!!
SCB->CFSR = 0x00008200
Bus Fault!
Fault address = 0x31303700
r0 = 0x00000000
r1 = 0xffffffff
r2 = 0x08045b08
r3 = 0xffd7b7de
r12 = 0x100a899f
lr = 0x100a9690
pc = 0x810f0000
psr = 0x3e01e136

This seems to be one of those nightmare, truly random and intermittent failures. Change one thing and the problem goes away, only to resurface later.

After my last post, I realized that my explanation didn’t really make sense, since a queue fixed size, so it filling up shouldn’t affect stack use, as the function only operated on one thing at a time.

Also previously, I misinterpreted the Configurable Fault Status Register (CFSR) register:

Here is the definition:

/** Configurable Fault Status Register - CFSR */
typedef struct
{
    /** MemManage Fault Status Sub-register - MMFSR */
    uint32_t iaccViol    : 1;  /**< MemManage Fault - The instruction access violation flag */
    uint32_t daccViol    : 1;  /**< MemManage Fault - The data access violation flag */
    uint32_t reserved1   : 1;  /**< Reserved */
    uint32_t mUnstkErr   : 1;  /**< MemManage Fault - Unstacking for a return from exception */
    uint32_t mStkErr     : 1;  /**< MemManage Fault - MemManage fault on stacking for exception entry */
    uint32_t mlspErr     : 1;  /**< MemManage Fault - MemManage fault occurred during floating-point lazy state preservation */
    uint32_t reserved2   : 1;  /**< Reserved */
    uint32_t mmarValid   : 1;  /**< MemManage Fault - The MemManage Address register valid flag */
    /** Bus Fault Status Sub-register - UFSR */
    uint32_t iBusErr     : 1;  /**< Bus Fault - The instruction bus error */
    uint32_t precisErr   : 1;  /**< Bus Fault - The precise Data bus error */
    uint32_t imprecisErr : 1;  /**< Bus Fault - The imprecise data bus error */
    uint32_t unstkErr    : 1;  /**< Bus Fault - Unstacking for an exception return has caused one or more bus faults */
    uint32_t stkErr      : 1;  /**< Bus Fault - Stacking for an exception entry has caused one or more bus faults */
    uint32_t lspErr      : 1;  /**< Bus Fault - A bus fault occurred during the floating-point lazy state */
    uint32_t reserved3   : 1;  /**< Reserved */
    uint32_t bfarValid   : 1;  /**< Bus Fault - The bus fault address register valid flag */
    /** Usage Fault Status Sub-register - UFSR */
    uint32_t undefInstr  : 1;  /**< Usage Fault - An undefined instruction */
    uint32_t invState    : 1;  /**< Usage Fault - The invalid state */
    uint32_t invPC       : 1;  /**< Usage Fault - An invalid PC */
    uint32_t noCP        : 1;  /**< Usage Fault - No coprocessor */
    uint32_t reserved4   : 4;  /**< Reserved */
    uint32_t unaligned   : 1;  /**< Usage Fault - Unaligned access */
    uint32_t divByZero   : 1;  /**< Usage Fault - Divide by zero */
    uint32_t reserved5   : 6;  /**< Reserved */
} cy_stc_fault_cfsr_t;

I have seen SCB->CFSR = 0x00008200 multiple times now, and I believe the true interpretation is
iBusErr : 1; /**< Bus Fault - The instruction bus error */, and
reserved3 : 1; /**< Reserved */.
What the heck is “Reserved” telling me?

Again, the lr = 0x100a9690 seems to point somewhere random in the code:

0x100A9680 <_Balloc>:
0x100A9680 ldr r3, [r0, #4c] ; 0x4c
0x100A9682 push {r4, r5, r6, lr}
0x100A9684 mov r5, r0
0x100A9686 mov r4, r1
0x100A9688 cbz r3, 100a969e <_Balloc+0x1e>
0x100A968A ldr.w r0, [r3, r4, lsl #2]
0x100A968E cbz r0, 100a96b2 <_Balloc+0x32>
0x100A9690 ldr r2, [r0, #0]
0x100A9692 str.w r2, [r3, r4, lsl #2]
0x100A9696 movs r3, #0
0x100A9698 strd r3, r3, [r0, #c]
0x100A969C pop {r4, r5, r6, pc}
0x100A969E movs r2, #21 ; 0x21
0x100A96A0 movs r1, #4

Again, Fault address = 0x31303700 and pc = 0x810f0000 seem to point somewhere consistent, but not in the memory map that I know. (I’ve attached the linker script and map files).
Turb_Mon.zip (99.6 KB)

richard-damon · June 24, 2020, 4:55pm

One cause for traps at strange addresses is that some function has clobbered its stack and overwritten a return address. Note the address 0x31303700 is the string “107”, if that rings a bell. One thing to do when it faults is look at the kernel variables, like the current TCB to determine what was running when things went wrong.

aggarg · June 24, 2020, 11:30pm

CFSR Assignment are:

BFSR is:

CFSR value in your case is: 0x00008200
And BFSR in binary is: ‭1000 0010

Mapping the 1 bits to the image above, the error is:
PRECISERR - Precise data bus error
BFARVALID - BusFault Address register (BFAR) valid

Fault address = 0x31303700 is most probably the content of BFAR telling the location that caused the fault. Since this is an invalid address, a fault is expected.

As Richard D mentioned, this looks like a memory corruption. If the earlier fix you did only reduces the frequency, then most probably it is masking the problem. So, I’d suggest to take that out so that the problem is frequently reproducible. After that, you can do few things:

Look if LR is always the same and if so, put a break-point in assembly to see where the call is coming from.
As Richard suggested, examine the value of pxCurrentTCB and then the memory at the location to figure our the task that was running when fault occurred. If you can find that out, see if removing that task fixes the problem as it will isolate the problem to that task only.

Thanks.

carlk3 · June 25, 2020, 5:40pm

Richard, Gaurav, thanks very much for your help!

Ah, I was reading the register bass akwards. Doh! Well, that makes more sense.

This is a great technique to know. Thanks again! I found out the task name and did a careful code review of the code for that task. No smoking guns. Then I spent a bunch of time manually adding bounds checking to all of my stack buffers. None of the checks triggered.

As an aside, wouldn’t it be great if -fsanitize (ubsan) were available in the GCC ARM Toolchain?

I’m now thinking maybe I have a hardware glitch. Maybe it depends where I set down my little keypad:

It’s enough to make me superstitious. Tomorrow I get some all new, different, hardware, so it will be interesting to see how that runs.

aggarg · June 25, 2020, 6:08pm

Did you also try to stop creating this task and and see if the problem still persists?

Thanks.

carlk3 · June 26, 2020, 4:56pm

After adding my bounds checking, I kicked off another run, and it never failed until this morning. So, no, but I would have if I had gotten another fault.

This morning’s fault was more interesting. There were many, many of these:

SnTTask: Timed out sending to xUserSDCrdQueue
SnTTask: Timed out sending to xUserSDCrdQueue
SnTTask: Timed out sending to xUserSDCrdQueue

followed by

Out of stack space!
Task: 0x8040790 SnTTask

This is SnTTask in its entirety:

void SnTTask(void *arg) {
	(void) arg;
    
	/* The xLastWakeTime variable needs to be initialized with the current tick
	 count. Note that this is the only time the variable is written to explicitly.
	 After this xLastWakeTime is automatically updated within vTaskDelayUntil(). */
	TickType_t xLastWakeTime = xTaskGetTickCount();
	while (!die_now) {
		/* This task should execute every 1000 milliseconds exactly. As per
		 the vTaskDelay() function, time is measured in ticks, and the
		 pdMS_TO_TICKS() macro is used to convert milliseconds into ticks.
		 xLastWakeTime is automatically updated within vTaskDelayUntil(), so is not
		 explicitly updated by the task. */
		vTaskDelayUntil(&xLastWakeTime, pdMS_TO_TICKS(1000));

		T1_t *pT1 = dbufs_T1_get_next_buf_to_write();
        T1_data_t *pBuffer = pT1->data;

        pT1->valid = false;
        
		// Populate the record
		time_t epoch_secs = FreeRTOS_time(NULL);
		pBuffer->header.time = epoch_secs;
		pBuffer->header.type = 1;
		
		generate(pBuffer->speed, N_SPEEDS, 5, 1700);      //SIMULATION
		generate(pBuffer->t_couple, N_T_COUPLES, 2, 100);  //SIMULATION            

        pT1->valid = true;
        dbufs_unlock_buf_from_write(&pT1->locks);
        
        /* Notify loggers */
		BaseType_t xStatus;
        SDCrdQueue_t qmsg = {LOGT1BUF, pT1};

        xStatus = xQueueSendToBack(xUserSDCrdQueue, &qmsg, pdMS_TO_TICKS(1000));
		if (pdTRUE != xStatus)
            DBG_PRINTF("%s: Timed out sending to xUserSDCrdQueue\n", __FUNCTION__);
        
        xStatus = xQueueSendToBack(xSysSDCrd0Queue, &qmsg, pdMS_TO_TICKS(1000));
		if (pdTRUE != xStatus)
            DBG_PRINTF("%s: Timed out sending to xSysSDCrd0Queue\n", __FUNCTION__);

        xStatus = xQueueSendToBack(xSysSDCrd1Queue, &qmsg, pdMS_TO_TICKS(1000));
		if (pdTRUE != xStatus)
            DBG_PRINTF("%s: Timed out sending to xSysSDCrd1Queue\n", __FUNCTION__);
            
        
        if (bleCommandQ) {
            ble_command_t qcmd = {PROCESS_T1_BUF, pT1};
            xStatus = xQueueSendToBack(bleCommandQ, &qcmd, pdMS_TO_TICKS(1000));
    		if (pdTRUE != xStatus)
                DBG_PRINTF("%s: Timed out sending to bleCommandQ\n", __FUNCTION__);
        }
        if (UIEventQ && !ui_suspend_updates) {
            Event evt = {NEW_T1_SIG, pT1};
            xStatus = xQueueSendToBack(UIEventQ, &evt, pdMS_TO_TICKS(500));
//    		if (pdTRUE != xStatus)
//                DBG_PRINTF("%s: Timed out sending to UIEventQ\n", __FUNCTION__);
        }
        
        // Swap next and last
        dbufs_T1_bufs_swap();
        
	}
    SnTTaskHandle = NULL;
	vTaskDelete(NULL);
}

I’m not seeing how that can run out of stack space. Is there anything I need to clean up when xQueueSendToBack times out? I can’t think of any recursive functions in this path. Maybe it’s just something in printf.

By the way, I have seen this “Timed out sending to xUserSDCrdQueue” before, and when I have pulled the SD card that is on the receiving end of that queue and stuck it into a PC, CHKDSK.EXE reports problems. I suspect there is some kind of loop in the FAT chains and FreeRTOS+FAT gets stuck in a loop. Another problem to diagnose. I might need to add some kind of check in +FAT to detect a loop. I don’t think this problem would tend to come up in normal operation, but when I am getting faults at random times while the “disk” is busy, it tends to corrupt the file system. (Brings back memories of PCs in the '80s!)

------------

UPDATE: I CHKDSK /F’d the SD card and restarted, and after several hours got another one of the 8200 faults:

SCB->CFSR = 0x00008200
Bus Fault!
Fault address = 0x38393133
r0 = 0x00000000
r1 = 0x00000003
r2 = 0x08045a20
r3 = 0xc1351998
r12 = 0x100a8fdf
lr = 0x100a9cd0
pc = 0x81010000
psr = 0xc1351998

lr = 0x100a9cd0 points here:

0x100A9CD0 ble.w 100a9354 <_dtoa_r+0x484>
0x100A9CD4 ldr r3, [pc, #128] ; (100a9e00 <_dtoa_r+0xf30>)
0x100A9CD6 movs r2, #0
0x100A9CD8 mov r0, r8
0x100A9CDA mov r1, r9

It never seems to be consistent.

This time it looks like it’s “8913”. This application spends lots of time doing things like this:

    for (unsigned i = 1; i < N_VIBRATIONS; ++i) {
		lResult = ff_fprintf((pUSDC->pT2_file), "%u", i);
		if (lResult < 0) {
			return FAIL("ff_fprintf");
        }
        if (N_VIBRATIONS - 1 == i) {
			lResult = ff_fprintf((pUSDC->pT2_file), "\n");
			if (lResult < 0) {
				return FAIL("ff_fprintf");
            }
        } else {
			lResult = ff_fprintf((pUSDC->pT2_file), ",");
			if (lResult < 0) {
				return FAIL("ff_fprintf");
            }
        }                
    }

where N_VIBRATIONS is 10,000, and pUSDC->pT2_file is a FreeRTOS+FAT FF_FILE * located on an SD card. So there are bound to be lots of ASCII number strings around.

|pxCurrentTCB|134477272 <ucHeap+44268>|0x08034444 (All)|struct tskTaskControlBlock { volatile StackType_t *pxTopOfStack; ListItem_t xStateListItem; ListItem_t xEventListItem; UBaseType_t uxPriority; StackType_t *pxStack; char pcTaskName[16]; UBaseType_t uxTCBNumber; UBaseType_t uxTaskNumber; UBaseType_t uxBasePriority; UBaseType_t uxMutexesHeld; void *pvThreadLocalStoragePointers[5]; uint32_t ulRunTimeCounter; volatile uint32_t ulNotifiedValue[1]; volatile uint8_t ucNotifyState[1]; } * volatile |
|---|---|---|---|
|pxTopOfStack|134476196 (All) |0x0803F5D8 (All)|volatile StackType_t *|
|*pxTopOfStack|134376224|0x0803F1A4 (All)|volatile unsigned long |
|xStateListItem|{...}|0x0803F5DC (All)|struct xLIST_ITEM { TickType_t xItemValue; struct xLIST_ITEM *pxNext; struct xLIST_ITEM *pxPrevious; void *pvOwner; struct xLIST *pxContainer; } |
|xItemValue|0|0x0803F5DC (All)|unsigned long |
|pxNext|134432012 (All) |0x0803F5E0 (All)|struct xLIST_ITEM *|
|pxPrevious|134432012 (All) |0x0803F5E4 (All)|struct xLIST_ITEM *|
|pvOwner|134477272 (All) |0x0803F5E8 (All)|void *|
|pxContainer|134432004 (All) |0x0803F5EC (All)|struct xLIST *|
|xEventListItem|{...}|0x0803F5F0 (All)|struct xLIST_ITEM { TickType_t xItemValue; struct xLIST_ITEM *pxNext; struct xLIST_ITEM *pxPrevious; void *pvOwner; struct xLIST *pxContainer; } |
|xItemValue|17|0x0803F5F0 (All)|unsigned long |
|pxNext|134474180 (All) |0x0803F5F4 (All)|struct xLIST_ITEM *|
|pxPrevious|134474180 (All) |0x0803F5F8 (All)|struct xLIST_ITEM *|
|pvOwner|134477272 (All) |0x0803F5FC (All)|void *|
|pxContainer|0 (All) |0x0803F600 (All)|struct xLIST *|
|uxPriority|3|0x0803F604 (All)|unsigned long |
|pxStack|134474240 (All) |0x0803F608 (All)|StackType_t *|
|*pxStack|2779096485|0x0803EA00 (All)|unsigned long |
|pcTaskName|[16]|0x0803F60C (All)|char [16]|
|0|85 'U'|0x0803F60C (All)|char|
|1|73 'I'|0x0803F60D (All)|char|
|2|84 'T'|0x0803F60E (All)|char|
|3|97 'a'|0x0803F60F (All)|char|
|4|115 's'|0x0803F610 (All)|char|
|5|107 'k'|0x0803F611 (All)|char|
|6|0 '\000'|0x0803F612 (All)|char|
|7|0 '\000'|0x0803F613 (All)|char|
|8|0 '\000'|0x0803F614 (All)|char|
|9|0 '\000'|0x0803F615 (All)|char|
|10|0 '\000'|0x0803F616 (All)|char|
|11|0 '\000'|0x0803F617 (All)|char|
|12|0 '\000'|0x0803F618 (All)|char|
|13|0 '\000'|0x0803F619 (All)|char|
|14|0 '\000'|0x0803F61A (All)|char|
|15|0 '\000'|0x0803F61B (All)|char|
|uxTCBNumber|9|0x0803F61C (All)|unsigned long |
|uxTaskNumber|0|0x0803F620 (All)|unsigned long |
|uxBasePriority|3|0x0803F624 (All)|unsigned long |
|uxMutexesHeld|0|0x0803F628 (All)|unsigned long |
|pvThreadLocalStoragePointers|[5]|0x0803F62C (All)|void *[5]|
|0|0 (All) |0x0803F62C (All)|void *|
|1|0 (All) |0x0803F630 (All)|void *|
|2|0 (All) |0x0803F634 (All)|void *|
|3|0 (All) |0x0803F638 (All)|void *|
|4|0 (All) |0x0803F63C (All)|void *|
|ulRunTimeCounter|66980637|0x0803F640 (All)|unsigned long |
|ulNotifiedValue|[1]|0x0803F644 (All)|volatile uint32_t [1]|
|0|0|0x0803F644 (All)|volatile unsigned long |
|ucNotifyState|[1]|0x0803F648 (All)|volatile uint8_t [1]|
|0|0 '\000'|0x0803F648 (All)|volatile unsigned char |

This is the same task as it was last time.

Now that I think of it, this UITask is updating an LCD and it, too, does a lot of snprintf()s that also could involve strings like “107” and “8913”.

carlk3 · June 27, 2020, 7:04am

I added a call to uxTaskGetStackHighWaterMark in the SnTTask loop, and was surprised to find it was down to 16. The other day it looked like this:

Task          State  Priority  Stack    # 
************************************************ 
UART Task       X       2       795     1 
IDLE            R       0       115     2 
SnTTask         B       5       178     8 
VibTask         B       4       147     7 
Tmr Svc         B       19      92      3 
CdDet SysSDCrd0 S       2       227     9 
CdDet SysSDCrd1 S       2       227     11 
SysSDCrd0Task   B       3       159     6 
SysSDCrd1Task   B       3       170     10 
UserSDCrdTask   B       3       210     4 
BLE Task        B       19      226     12 
Status LED Task B       2       79      13 
UITask          B       2       539     14 
CdDet UserSDCrd S       2       227     5

Today it was this:

Task          State  Priority  Stack    #
************************************************
UART Task       X       2       868     1
IDLE            R       0       175     2
VibTask         B       2       219     10
SnTTask         B       5       16      11
Card Detect     S       2       229     5
BLE Task        B       19      270     8
SysSDCrd0Task   B       4       229     6
SysSDCrd1Task   B       4       229     7
UserSDCrdTask   B       4       404     4
UITask          B       3       514     9
Tmr Svc         B       19      163     3

I bumped up the task size in xTaskCreate from 256 to 384, and now I have this:

Task          State  Priority  Stack    #
************************************************
UART Task       X       2       836     1
IDLE            R       0       175     2
VibTask         B       2       177     10
SnTTask         B       5       134     11
Card Detect     S       2       229     5
BLE Task        B       19      270     8
UserSDCrdTask   B       4       333     4
SysSDCrd1Task   B       4       298     7
SysSDCrd0Task   B       4       298     6
UITask          B       3       510     9
Tmr Svc         B       19      163     3

I’m not sure what caused the change, but maybe that helps explain the Out of stack space! error.

richard-damon · June 27, 2020, 10:27am

Some times stack usage is not deterministic, but is based on the value of parameters. This is especially true if some function is recursive (and some formatting routines can be recursive, *printf is well know for needing lots of, and variable amount of, stack)

carlk3 · June 27, 2020, 4:41pm

I got another 8200 Bus Fault this morning. I had added this to SnTTask:

static UBaseType_t last_shwm;
UBaseType_t shwm = uxTaskGetStackHighWaterMark(NULL);
if (shwm < last_shwm) {
    DBG_PRINTF("%s: StackHighWaterMark=%lu, was %lu\n", __FUNCTION__, shwm, last_shwm);
}
last_shwm = shwm;

and the last messages from that were:

SnTTask: StackHighWaterMark=290, was 307
SnTTask: StackHighWaterMark=134, was 290

So, I think I know that the 8200 fault is not being caused by SnTTask running out of stack.

Here is what I got this morning:

SCB->CFSR = 0x00008200
Bus Fault!
Fault address = 0x36000000
r0 = 0x00000000
r1 = 0xffffffff
r2 = 0x08045a00
r3 = 0x7ff77fff
r12 = 0x100a913f
lr = 0x100a9e30
pc = 0x81010000
psr = 0xbedbd5b9

Again, the fault address could plausibly be the end of an ASCII string of numbers. Again, the pc = 0x81010000 makes no sense.

Here is the call stack:

0 Cy_SysLib_ProcessingFault() main_cm4.c 401 0x10088600 (All) 
1 Cy_SysLib_FaultHandler(const uint32_t * faultStackAddr = <optimized out>) Generated_Source\PSoC6\pdl\drivers/peripheral/syslib/cy_syslib.c 444 0x10082B32 (All) 
2 UsageFault_Handler() gcc/startup_psoc6_01_cm4.S 455 0x1008034E (All) 
3 <signal handler called>() ?????? ?????? 0xFFFFFFED (All) 
4 prvPortStartFirstTask() ..\FreeRTOS\FreeRTOS\Source\portable\GCC\ARM_CM4F/port.c 267 0x100865E4 (All) 
5 xPortStartScheduler() ..\FreeRTOS\FreeRTOS\Source\portable\GCC\ARM_CM4F/port.c 379 0x1008AB6E (All) 
6 cy_ble_flashStorage() ?????? ?????? 0x14000E00 (All)

and registers:

r0=,0x00000000,r1=,0x90000000,r2=,0x00000000,r3=,0x08026B18,r4=,0x08027890,
r5=,0x08026B20,r6=,0x00000001,r7=,0x00000004,r8=,0x5FA23F57,r9=,0xBFEBB37D,
r10=,0x00000001,r11=,0x00000000,r12=,0xFFFFFFFF,sp=,0x080477CC,lr=,0x1009E343,
pc=,0x10088600,xpsr=,0x01010005,msp=,0x080477CC,psp=,0x0803F1D8,

at this point in the code:

printf("pc = 0x%08lx\r\n", (unsigned long)cy_faultFrame.pc);
printf("psr = 0x%08lx\r\n", (unsigned long)cy_faultFrame.psr);
fflush(stdout);

__disable_irq(); /* Disable global interrupts. */
while(1);

lr = 0x100a9e30 points here:

0x100A9E20 <_Balloc>:
0x100A9E20 ldr	r3, [r0, #4c]	; 0x4c
0x100A9E22 push	{r4, r5, r6, lr}
0x100A9E24 mov	r5, r0
0x100A9E26 mov	r4, r1
0x100A9E28 cbz	r3, 100a9e3e <_Balloc+0x1e>
0x100A9E2A ldr.w	r0, [r3, r4, lsl #2]
0x100A9E2E cbz	r0, 100a9e52 <_Balloc+0x32>
0x100A9E30 ldr	r2, [r0, #0]
0x100A9E32 str.w	r2, [r3, r4, lsl #2]
0x100A9E36 movs	r3, #0
0x100A9E38 strd	r3, r3, [r0, #c]
0x100A9E3C pop	{r4, r5, r6, pc}

Memory around there looks like:

|0x100a9e18|00|28|b8|bf|40|42|70|47|.(..@BpG|
|---|---|---|---|---|---|---|---|---|---|
|0x100a9e20|c3|6c|70|b5|05|46|0c|46|.lp..F.F|
|0x100a9e28|4b|b1|53|f8|24|00|80|b1|K.S.$...|
|0x100a9e30|02|68|43|f8|24|20|00|23|.hC.$ .#|
|0x100a9e38|c0|e9|03|33|70|bd|21|22|...3p.!"|
|0x100a9e40|04|21|01|f0|53|fa|03|46|.!..S..F|
|0x100a9e48|e8|64|00|28|ed|d1|00|20|.d.(... |
|0x100a9e50|70|bd|01|21|01|fa|04|f6|p..!....|
|0x100a9e58|72|1d|28|46|92|00|01|f0|r.(F....|

pxCurrentTCB:

|pxCurrentTCB|134477240 <ucHeap+44228>|0x0803444C (All)|struct tskTaskControlBlock { volatile StackType_t *pxTopOfStack; ListItem_t xStateListItem; ListItem_t xEventListItem; UBaseType_t uxPriority; StackType_t *pxStack; char pcTaskName[16]; UBaseType_t uxTCBNumber; UBaseType_t uxTaskNumber; UBaseType_t uxBasePriority; UBaseType_t uxMutexesHeld; void *pvThreadLocalStoragePointers[5]; uint32_t ulRunTimeCounter; volatile uint32_t ulNotifiedValue[1]; volatile uint8_t ucNotifyState[1]; } * volatile |
|---|---|---|---|
|pxTopOfStack|134476132 (All) |0x0803F5B8 (All)|volatile StackType_t *|
|xStateListItem|{...}|0x0803F5BC (All)|struct xLIST_ITEM { TickType_t xItemValue; struct xLIST_ITEM *pxNext; struct xLIST_ITEM *pxPrevious; void *pvOwner; struct xLIST *pxContainer; } |
|xEventListItem|{...}|0x0803F5D0 (All)|struct xLIST_ITEM { TickType_t xItemValue; struct xLIST_ITEM *pxNext; struct xLIST_ITEM *pxPrevious; void *pvOwner; struct xLIST *pxContainer; } |
|uxPriority|3|0x0803F5E4 (All)|unsigned long |
|pxStack|134474208 (All) |0x0803F5E8 (All)|StackType_t *|
|pcTaskName|[16]|0x0803F5EC (All)|char [16]|
|0|85 'U'|0x0803F5EC (All)|char|
|1|73 'I'|0x0803F5ED (All)|char|
|2|84 'T'|0x0803F5EE (All)|char|
|3|97 'a'|0x0803F5EF (All)|char|
|4|115 's'|0x0803F5F0 (All)|char|
|5|107 'k'|0x0803F5F1 (All)|char|
|6|0 '\000'|0x0803F5F2 (All)|char|
|7|0 '\000'|0x0803F5F3 (All)|char|
|8|0 '\000'|0x0803F5F4 (All)|char|
|9|0 '\000'|0x0803F5F5 (All)|char|
|10|0 '\000'|0x0803F5F6 (All)|char|
|11|0 '\000'|0x0803F5F7 (All)|char|
|12|0 '\000'|0x0803F5F8 (All)|char|
|13|0 '\000'|0x0803F5F9 (All)|char|
|14|0 '\000'|0x0803F5FA (All)|char|
|15|0 '\000'|0x0803F5FB (All)|char|
|uxTCBNumber|9|0x0803F5FC (All)|unsigned long |
|uxTaskNumber|0|0x0803F600 (All)|unsigned long |
|uxBasePriority|3|0x0803F604 (All)|unsigned long |
|uxMutexesHeld|0|0x0803F608 (All)|unsigned long |
|pvThreadLocalStoragePointers|[5]|0x0803F60C (All)|void *[5]|
|ulRunTimeCounter|73940570|0x0803F620 (All)|unsigned long |
|ulNotifiedValue|[1]|0x0803F624 (All)|volatile uint32_t [1]|
|ucNotifyState|[1]|0x0803F628 (All)|volatile uint8_t [1]|
|pxReadyTasksLists|[20]|0x080344D0 (All)|List_t [20]|

Same task as before; UITask.

If I look around at pxTopOfStack from the above, I see:

|0x0803f14c|00|00|00|00|03|00|00|00|........|
|---|---|---|---|---|---|---|---|---|---|
|0x0803f154|01|00|00|00|01|00|00|00|........|
|0x0803f15c|00|00|00|00|28|f3|03|08|....(...|
|0x0803f164|88|5a|04|08|fc|03|00|00|.Z......|
|0x0803f16c|00|00|00|a0|57|56|c0|3f|....WV.?|
|0x0803f174|00|00|00|a0|57|56|c0|3f|....WV.?|
|0x0803f17c|57|56|c0|3f|57|56|c0|3f|WV.?WV.?|
|0x0803f184|ed|ff|ff|ff|70|f4|03|08|....p...|
|0x0803f18c|df|df|df|da|00|00|00|a0|........|

Is there a good way to examine the stack for corruption?

Thanks again,

 Carl

carlk3 · June 27, 2020, 4:52pm

I got some new hardware, and I have been running three instances of this project. However, I only have one LCD module, so only one instance is running the UITask operations that create and update the display. I haven’t seen any Bus Faults from the other instances, lately, so I think the problem is probably in the display stuff. But where? Also, I am using emWin library for the drawing, so there is a third-party library involved, and I don’t have source for that. However, I think it is pretty rock solid.

EDIT: I just thought of another difference when the LCD module is involved. It has a rather dumb T6963 interface, and the LCD routines spend a lot of time in spin loops.

Run Time Statistics without LCD:

Task            Abs Time      % Time
****************************************
UART Task       17              <1%
IDLE            379806614               97%
VibTask         74141           <1%
SnTTask         195605          <1%
Card Detect     0               <1%
BLE Task        209792          <1%
SysSDCrd0Task   38060           <1%
SysSDCrd1Task   551             <1%
UserSDCrdTask   9940233         2%
UITask          194986          <1%
Tmr Svc         0               <1%

With LCD:

Task            Abs Time      % Time
****************************************
UART Task       16              <1%
IDLE            271197937               73%
VibTask         62220           <1%
SnTTask         148123          <1%
Tmr Svc         3               <1%
Status LED Task 1               <1%
BLE Task        198600          <1%
SysSDCrd0Task   268382          <1%
SysSDCrd1Task   200311          <1%
UserSDCrdTask   10439195                2%
UITask          87454534                23%
Card Detect     0               <1%

I’m pretty sure that all that CPU time spent in the UITask is in a couple of routines such as this:

static /* inline */ bool StatusCheck0and1 (void){
    uint8 DispStatus;
   // Normally takes 59 us
    TickType_t xStart = xTaskGetTickCount();
    do {
        DispStatus = GLCD_Read8 (Status);
        DispStatus = DispStatus & 0b00000011;
        if ( DispStatus == 0b00000011) {
            return true;
        }
    } while (xTaskGetTickCount() - xStart < pdMS_TO_TICKS(2));
    DBG_PRINTF("%s failed.\n", __FUNCTION__);
    return false;    
}

richard-damon · June 27, 2020, 7:03pm

One thing to note, the following lines in your stack trace:

Tell me that your debugger isn’t smart enough to handle stack traces in ISRS, that 3rd line is the signal to switch from the ISR stack to the program stack, but it didn’t, and just traced what was leftover on the main stack before the tasks started.

You may need to manually backtrack in the program stack to see where it was coming from. (that stack would start at the psp and go upwards)

I do note that at the point of the fault, the stack isn’t over limit, as it is reported as 0x0803F1d8 and the base of the stack (converting the decimal number to hex) is 0x0803E9E0, but if something overwrote part of the stack we don’t know.

Sometimes if you don’t want to analyze the stack in detail, you can scan it for addresses that look like code addresses (0x100xxxxx) and look at those addresses to see if you recognize any return address links on the stack (they should be pointing to just after a call/branch and link of some sort.