SMP Spinlocks Deadlock

As per this Thread I’m currently working on an SMP Port for the Cortex a53(s) on the Zynq Ultrascale+ MPSOC Platform. For quite a while now I’m sort of finished (i.e it seems to work well).

My next step however was to load the full_demo (with all Test Tasks) onto the Processor. That’s when I started to notice that my spinlocks keep blocking and nothing continues to execute. This issue has nothing to do this the tests itself because the error occurs in the kernel. I verified independently of FreeRTOS that my spinlocks are in fact working correctly. I also verified that my context saving and restoring is working as expected.

I even integrated a third (test) spinlock into FreeRTOS for further testing. I try locking and unlocking it in two separate FreeRTOS tasks. And that’s were I noticed that whenever I disable my interrupts my spinlocks are working fine again. I even narrowed it down: I only need to disable interrupts whenever my spinlocks are locked (from locking to unlock CONTINUOUSLY).

Here is what seems to be happening:
Assume two Tasks that want to aquire a Spinlock (either through the FreeRTOS API or the test spinlock, doesn’t matter). Task A runs on Core 0 and aquires the lock. Then a interrupt happens because the interrupts were enabled again. Therefor it might happen that Task A gets rescheduled on Core 1. Whenever I try to unlock the Lock it now detects its not the owner of the lock an ends up crashing or not unlocking. Since the Spinlock never gets unlocked Task B (when run on Core 1) cannot aquire the lock because it is locked by Core 0. Or it can lock and unlock it at will (when running on core 0), but will never get to the point of unlocking it so that core 1 can use it.

So whenever a lock is aquired the Interrupt needs to be disabled as long a the lock is held.

How does FreeRTOS with SMP make sure that this never happens? From my point of understanding this is the responsibility of the developer. If this is the case could this be the explanation why the Spinlock issues occur?

Check out this Stacktrace from when the thing happened:

APU	
	Cortex-A53 #0 (External Debug Request), EL3(S)/A64	
		0x000000000004d754 vPortRecursiveLock(): port.c, line 263	
		0x00000000000496d0 vTaskSwitchContext(): tasks.c, line 5153	
		0x0000000000041900 _interrupt_handlers(): portASM.S, line 331	
		0x00000000000455e0 xQueueSemaphoreTake(): queue.c, line 1804	
		0x00000000000519d8 vInterruptMutexSlaveTask(): IntSemTest.c, line 388	
		0x0000000000000000	
	Cortex-A53 #1 (External Debug Request), EL3(S)/A64	
		0x000000000005441c vCompetingMathTask4(): flop.c, line 287	
		0x0000000000000000	

The value of the Spinlocks at this moment look like this:
image
All zeros means unlocked. Spinlock at index 1 is locked by core 1 once.

Whenever kernel acquires task lock using portGET_TASK_LOCK macro, the interrupts are disabled or the scheduler is suspended. Therefore, the task which has acquired task lock can not be rescheduled on a different core until it releases the task lock. How is it happening in your code?

1 Like

With a few minor modifications I was able to add one simple check to both the unlocking and locking function:

		if(!((daif_value & (0b11<<6)) ||
						*((uint32_t*)0x00F9020004) == ( uint32_t ) ( configMAX_API_CALL_INTERRUPT_PRIORITY << portPRIORITY_SHIFT ) ||
						uxSchedulerSuspended > 0)) {
							while (1);
						}

It checks whether the interrupts are masked on processor level or on GIC Level using the masking register or if the scheduler is suspended.

This assert fails right at startup during vTaskSwitchContext when the ISR lock get released at the end. This is because the scheduler is not suspended, the interrupts are not masked on the GIC and the Interrupts are not disabled on the core itself.

image

APU	
	Cortex-A53 #0 (Breakpoint: port.c:290), EL3(S)/A64	
		0x000000000004d820 vPortRecursiveUnlock(): port.c, line 290	
		0x00000000000497f8 vTaskSwitchContext(): tasks.c, line 5235	
		0x0000000000041900 _interrupt_handlers(): portASM.S, line 331	
		0x0000000000000000	
	Cortex-A53 #1 (External Debug Request), EL3(S)/A64	
		0x000000000004d750 vPortRecursiveLock(): port.c, line 270	
		0x00000000000496d0 vTaskSwitchContext(): tasks.c, line 5153	
		0x0000000000041900 _interrupt_handlers(): portASM.S, line 331	

At this point the looks seem very strange.

What is the purpose of this check? vTaskSwitchContext is called from an interrupt and another interrupt cannot be taken until this one completes?

The idea of this check was to verify your statement “every time the lock is taken the interrupts are disabled/masked or the scheduler is suspended”. However as you can see in my previous post this not always the case.
From my understanding the a53 port (without smp) supports interrupt nesting so your statement from above is not correct. During the isr (where vTaskSwitchContext is called) another interrupt may occur and may be triggered.
Anyhow I think we went a bit lost from my original question: Is it required that the interrupts are disabled during the hole period of time while a lock is acquired until it gets released again? And if this is the case how is it ensured?
Since this is quite a complicated topic which is not easily analyzed from a distance would like to offer a quick remote debugging session.

Wee recently got this port contribution - Adding support for ZynqMPSoc(A53*4) by yunyafeng · Pull Request #13 · FreeRTOS/FreeRTOS-Kernel-Community-Supported-Ports · GitHub. Would you please give it a try and see if you face the same problem with this one too? If you still face the same problem, please DM me with your email ID and preferred timeslots.