Extending the xPortPendSVHandler save area for ARM_CM0

I’m trying to adapt FreeRTOS-Kernel/portable/GCC/ARM_CM0/port.c for Raspberry Pi Pico. The Pico has a hardware divider whose state must be saved for context switches. The state consists of 4 32-bit words. (See hw_divider_save_state and hw_divider_restore_state in pico-sdk/divider.S at master · raspberrypi/pico-sdk · GitHub.

Here is my attempt:

void xPortPendSVHandler( void )
{

    /* This is a naked function. */

	// Save area:
	//psp->				|	0	
	//					|	-4	r11
	//					|	-8	r10
	//					|	-12	r9
	//pxTopOfStack + 32	|	-16	r8
	//					|	-20	r7
	//					|	-24	r6
	//					|	-28	r5
	//psp - 32			|	-32	r4
	//					|	-36	SIO_DIV_QUOTIENT
	//					|	-40	SIO_DIV_REMAINDER
	//					|	-44	SIO_DIV_UDIVISOR
	//pxTopOfStack->	|	-48	SIO_DIV_UDIVIDEND

    __asm volatile
    (
        "	.syntax unified						\n"
        "	mrs r0, psp							\n"
        "										\n"
        "	ldr	r3, pxCurrentTCBConst			\n"/* Get the location of the current TCB. */
        "	ldr	r2, [r3]						\n"
        "										\n"
        "	subs r0, r0, #32					\n"/* Make space for the remaining low registers. */
        "	stm r0!, {r4-r7}					\n"/* Store the low registers that are not saved automatically. */
        " 	mov r4, r8							\n"/* Store the high registers. */
        " 	mov r5, r9							\n"
        " 	mov r6, r10							\n"
        " 	mov r7, r11							\n"
        " 	stm r0!, {r4-r7}					\n"
        "										\n"
        "	subs r0, r0, #48					\n"/* Make space for divider state. */
        "	str r0, [r2]						\n"/* Save the new top of stack. */
        "										\n"
        "	ldr r2, =SIO_BASE					\n"/* hw_divider_save_state */
        "	ldr r1, [r2, #0x00000078]			\n"/* SIO_DIV_CSR_OFFSET (sio.h) */
		/* wait for results as we can't save signed-ness of operation */
        "MY1:									\n"
        "	lsrs r1, 1							\n"/* #SIO_DIV_CSR_READY_SHIFT_FOR_CARRY */
        "	bcc MY1								\n"
        "	ldr r4, [r2, #0x00000060]			\n"/* SIO_DIV_UDIVIDEND_OFFSET */
        "	ldr r5, [r2, #0x00000064]			\n"/* SIO_DIV_UDIVISOR_OFFSET */
        "	ldr r6, [r2, #0x00000074]			\n"/* SIO_DIV_REMAINDER_OFFSET */
        "	ldr r7, [r2, #0x00000070]			\n"/* SIO_DIV_QUOTIENT_OFFSET */
        "	stm r0!, {r4-r7}					\n"/* Save HW divider state */
        "										\n"
        "	push {r3, r14}						\n"
        "	cpsid i								\n"
        "	bl vTaskSwitchContext				\n"
        "	cpsie i								\n"
        "	pop {r2, r3}						\n"/* lr goes in r3. r2 now holds tcb pointer. */
        "										\n"
        "	ldr r1, [r2]						\n"
        "	ldr r0, [r1]						\n"/* The first item in pxCurrentTCB is the task top of stack. */
        "										\n"
        "	ldr r2, =SIO_BASE					\n"/* hw_divider_restore_state */
        "	ldm r0!, {r4-r7}					\n"
        "	str r4, [r2, #0x00000060]			\n"/* SIO_DIV_UDIVIDEND_OFFSET */
        "	str r5, [r2, #0x00000064]			\n"/* SIO_DIV_UDIVISOR_OFFSET */
        "	str r6, [r2, #0x00000074]			\n"/* SIO_DIV_REMAINDER_OFFSET */
        "	str r7, [r2, #0x00000070]			\n"/* SIO_DIV_QUOTIENT_OFFSET */
        "										\n"
        "	adds r0, r0, #16					\n"/* Move to the high registers. */
        "	ldm r0!, {r4-r7}					\n"/* Pop the high registers. */
        " 	mov r8, r4							\n"
        " 	mov r9, r5							\n"
        " 	mov r10, r6							\n"
        " 	mov r11, r7							\n"
        "										\n"
        "	msr psp, r0							\n"/* Remember the new top of stack for the task. */
        "										\n"
        "	subs r0, r0, #32					\n"/* Go back for the low registers that are not automatically restored. */
        " 	ldm r0!, {r4-r7}					\n"/* Pop low registers.  */
        "										\n"
        "	bx r3								\n"
        "										\n"
        "	.align 4							\n"
        "pxCurrentTCBConst: .word pxCurrentTCB	\n"
        "SIO_BASE: .word SIO_BASE"
    );
}

I also added

pxTopOfStack -= 4;                                       /* Divider state */

to pxPortInitialiseStack. Alas, it doesn’t work. Gets stuck in a loop, I think (but past the bcc MY1 loop). Is there anything else I need to change? Have I made a dumb mistake in the assembly code? (I am a beginner at ARM assembly.)

My starting point was FreeRTOS-Kernel/port.c at main · FreeRTOS/FreeRTOS-Kernel · GitHub. I’ve attached my modified version.
port.c (24.9 KB)

You managed to get hardware?

This statement doesn’t look right:

Is SIO_BASE the address of the hardware divider peripheral? If so you should be able to eliminate this line altogether. Gcc should manage putting the value into the literal pool for you. I think.

I managed to get one. Then, I quickly learned that the easiest way to get debugging capability is to use a second one as a “PicoProbe”. Been waiting for that one for months. [Why was I such a cheapass? Should’ve gotten the 3-pack.]

If I try that, I get [build] /usr/lib/gcc/arm-none-eabi/9.2.1/../../../arm-none-eabi/bin/ld: libFreeRTOS-Kernel.a(port.c.obj): in function pxCurrentTCBConst’:
[build] port.c:(.text.isr_pendsv+0x64): undefined reference to SIO_BASE' [build] collect2: error: ld returned 1 exit status.

Thanks.

In FreeRTOS style, you probably need to #define SIO_BASE in port.c. Alternatively, change the #SIO_BASE to #0xD0000000 – the value of SIO_BASE.

	/* hw_divider_save_state */
    "	ldr r2, =#0xD0000000				\n"/* SIO_BASE */

I tried that, but it still hangs. Then I tried commenting out this line:

//" bcc MY1 \n"

but it still hangs.

I am no longer sure that it is making it as far as bl vTaskSwitchContext.

What’s a good assembly trick for causing a hard fault? In C I use this:

void (*bad_instruction)(void) = (void *)0xE0000000;
bad_instruction();

" UDF #0" should cause an undefined instruction exception.

I looked at the port.c you attached above, and it doesn’t look like you modified vPortStartFirstTask, which you will need to do. It seems your first task is never even starting.

Ah, that was a problem! I changed

    "	adds r0, #32    			\n"/* Discard everything up to r0. */

to

    "	adds r0, #48    			\n"/* Discard everything up to r0. */

and now I am getting a hard fault for free! Unfortunately,

Link Register (LR): A5A5A5A5
Program Counter (PC): A5A5A5A4

is not very helpful, except that it looks like the fill pattern for stack when #define configCHECK_FOR_STACK_OVERFLOW 2.

UPDATE: Scratch that. I still had one of my intentional hardfaults in there. Now, it is running, and passing tests!

Thanks so much, Jeff!

The version that runs: port.c (24.9 KB)

Glad it worked!

Not to get too far into the weeds on this thread, but I’m curious about this solution. Why save the divider state? Does the compiler use the divider directly? Or is there a target usage model where protecting the divider with critical section or mutex would not be suitable? I ask because this divider peripheral seems very much like a CRC module or similar, which traditionally is shared by way of application code not OS code.

By default, the Pico SDK overrides the / (and similar) operators to use the hardware divider. The work-around is to add

pico_set_divider_implementation(TARGET compiler)

in the CMakeLists.txt.

If you fail to do this you get no error indications, just weird and subtle intermittent data errors, and divisions are everywhere, so finding that was hell. (See Context switching breaks SPI/DMA on the Raspberry forums.)

1 Like

A last curiosity - seems to me you need to save the signed input registers too, no?

I’m not sure. Here’s what the datasheet says:

To support save and restore on interrupt handler entry/exit (or on e.g. an RTOS context switch), the result registers are
also writable. Writing to a result register will cancel any operation in progress at the time. The DIV_CSR.DIRTY flag can
help make save/restore more efficient: this flag is set when any divider register (operand or result) is written to, and
cleared when the quotient is read.

I think that’s an optimization I could look at in the future.

The SDK module pico_divider https://github.com/raspberrypi/pico-sdk/tree/master/src/common/pico_divider/include/
pico/divider.h provides both the AEABI implementation needed to hook the C / and % operators for both 32-bit and 64-bit
integer division, as well as some additional C functions that return quotients and remainders at the same time. All of
these functions correctly save and restore the hardware divider state (when dirty) so that they can be used in either user
or IRQ handler code.

The SDK book says this:

47 // For a really fast divide, you can use the inlined versions... the / involves a function
  call as / always does
48 // when using the ARM AEABI, so if you really want the best performance use the inlined
  versions.
49 // Note that the / operator function DOES use the hardware divider by default, although
  you can change
50 // that behavior by calling pico_set_divider_implementation in the cmake build for your
Raspberry Pi Pico C/C++ SDK
  target.
51 printf("%d / %d = (by operator %d) (inlined %d)\n", dividend, divisor,
52 dividend / divisor, hw_divider_s32_quotient_inlined(dividend, divisor));
53
54 // Note however you must manually save/restore the divider state if you call the inlined
methods from within an IRQ
55 // handler.
56 hw_divider_state_t state;
57 hw_divider_divmod_s32_start(dividend, divisor);
58 hw_divider_save_state(&state);
59
60 hw_divider_divmod_s32_start(123, 7);
61 printf("inner %d / %d = %d\n", 123, 7, hw_divider_s32_quotient_wait());
62
63 hw_divider_restore_state(&state);
64 int32_t tmp = hw_divider_s32_quotient_wait();
65 printf("outer divide %d / %d = %d\n", dividend, divisor, tmp);

My thought was that if I could copy the behavior of hw_divider_save_state and hw_divider_restore_state, that should do the job. I’m hoping to get some more advice from the Raspberry people.

The divider takes only 8 cycles to perform a division, so why not just disable interrupts while it is in use? Is interrupt latency so critical in your application that you can’t afford another ~10 cycles of latency?

I use a different Arm Cortex M0+ MCU that also has a hardware divider, and the manufacturer-supplied EABI division function does exactly that (save the interrupt state, disable interrupts, do the division, and restore the state).

Besides, with saving/restoring the state during context switches instead, if you ever do an integer division in an ISR then you will corrupt the divider state unless you manually save/restore the state in the ISR.

That might be a good approach. The other one I’m wondering about is writing to a divider result register in PendSV, which will cancel any operation in progress at the time.

I guess there are at least three possibilities:

  1. Save/restore in PendSV, as above .
  2. Disable interrupts during divides.
  3. Cancel divide in progress by writing to result register in PendSV.

That was the original purpose of the hw_divider_save_state and hw_divider_restore_state routines provided by the SDK.

I was considering saving the divider context last so it would have completed any divide by the time the context switch got to the divider. Then restore/restart the divider context early when restoring the task so it will have a “fresh” divide ready for the previously preempted task. This does 1 extra divide per context switch but it should not take any additional time.