SMP Porting to Cortex R5

Hi,
We make an attempt to port SMP version to Cortex R5. Work is inspired by RP2040 port.
I’d appreciate some porting checklist.

  1. Tick interrupt. Happens only on master core or identical on both (all cores)? Other scheme?
  2. Ignition and initialization of Second core. Simple loop that waits for sp, pc (xPortStartSchedulerOnCore or xPortStartScheduler) and interrupt table address , once set jumps to given pc. Is this correct?
  3. What else we need to know to implement multi-core safe mutexes, locks, etc. Is there some dependency of Processor SDK (out of FreeRTOS source tree)?
  4. Any static data that kernel uses that we need to be aware of?
  5. If I understand correctly, both cores run the same scheduler, and it picks tasks from queue that is visible on both cores. Is it correct?

Our target is offloading of computation tasks to the second core, no need in Interrupts/peripherals access on the second core.

Any help would be appreciated.

Best regards
Rasty

Good questions and we certainly need to create a porting checklist for SMP.

For now here you go:

  1. You will need a tick interrupt on 1 core. The tick handler will execute portYIELD_CORE(corenumber) when the other core(s) need to reschedule their work.
  2. You will need to do something specific to your architecture to launch the second core. The first core to run will call xPortStartScheduler() and your porting layer is expected to get all the cores running. On the RP2040 you can see it with an API call to multicore_launch_core1(functionPointerToLaunch).
  3. The key piece is to ensure the two LOCKS (below) are implemented with hardware.
  4. The kernel data in the is protected by two hardware spinlocks. These are the TASK lock and the ISR lock. You will need to ensure you have implemented portGET_TASK_LOCK(), portRELEASE_TASK_LOCK(), portGET_ISR_LOCK() and portRELEASE_ISR_LOCK(). These need to be implemented in such a way that NO CORE can get past the lock. On the RP2040 this is a hardware spinlock so when the second core reaches the GET, it will halt until the other core executes RELEASE.
  5. Exactly right. You can lock tasks to specific cores with core affinity. Doing so may increase the execution cycles available to a specific task but it will also make SOME context switches require two checks of the waiting task queue because the highest priority task may not be allowed to run on the core.
1 Like

Thank you very much for fast reply!
I need some clarification for question #1.
I want to make sure that I understand you correctly.
Tick interrupt runs only on first (0) master core and triggers scheduling on the second core.
There no tick interrupt on slave cores.
Is that correct?

Best regards
Rasty

That is correct.

You can see this behavior in the RP2040 port by looking here: xPortStartSchedulerOnCore() You will see that the first thing done in that function is check the core # and if it is the primary core it starts the tick timer interrupt.

The tick interrupt eventually calls portYIELD_CORE(corenumber) which is macro’ed to vYieldCore(corenumber) in portmacros.h (RP2040 port). The implementation is below but the key pieces are:

  1. It will only yield a different core than the one running…
  2. It writes to a HW FIFO for intercore communications. That write triggers an interrupt that is used to cause the core to run the scheduler.
void vYieldCore( int xCoreID )
{
    configASSERT(xCoreID != portGET_CORE_ID());
    #if portRUNNING_ON_BOTH_CORES
        /* Non blocking, will cause interrupt on other core if the queue isn't already full,
        in which case an IRQ must be pending */
        sio_hw->fifo_wr = 0;
    #endif
}

OK. So instead of tick interrupt, master core send a software interrupt to the slave core, right?

Another question about cache coherency.
We have 2 R5 cores, caches are not coherent. How would you suggest to ensure RTOS memory coherency ? Memory barriers are enough ?

Yes. the master core takes the tick and issues a soft interrupt to the remaining cores. The XMOS port is a good example of going beyond 2 cores.

Any I-Cache will be fine. D-Cache could be used but you will need to take special care.

  1. Lock the affinity of tasks that will use the data cache so they stay on a single CPU.
  2. Ensure that cached data is not shared between tasks that are not affined to a single CPU.

If the cache was flushed on every context switch you could use the cache in a single core system, but because the context switches can happen asynchronously tasks that share data can easily run into issues with the cache.

For the interested observers the ARM technical documentation is here: Documentation – Arm Developer

Some thought experiments:

  1. Task X moves from core A to core B.
    — The context save should flush the cache and this should be fine.
  2. Task X shares data with Task Y. X is running in A and Y is running in B.
    — The shared data must have a mutex. The FreeRTOS mutex would need to be using non-cached RAM.
  3. FreeRTOS operations are running in both cores.
    — Critical memory is protected with with HW spinlocks.
    — The Critical memory must either be non-cached… or,
    — The critical sections must flush the cache on release and invalidate the cache on get. This is not demonstrated in the current ports.

Posting a link to this page: Officially supported and contributed FreeRTOS code - some definitions to show the path for contributing your port once its running :slight_smile:

Once we get acceptable results :crossed_fingers: and clean core we will upload code to github.

Hi Joseph,

I’m able to trigger an interrupt on core1 from void vYieldCore( int xCoreID ) when executed on core0.
When you say “cause the core to run the scheduler”, which function exactly must be invoked in the ISR executed on core1?

Thanks,

Nir.

Going over RP2040/port.c: void prvFIFOInterruptHandler() again,

My guess is that
portYIELD_FROM_ISR(pdTRUE);
should be invoked in the ISR. Is that correct?

Another question. xPortStartSchedulerOnCore() is invoked on both core0 and core1, which suggests that the inter-core interrupt is registered on both cores. If so, core1 will also trigger an interrupt on core0 when vYieldCore( int xCoreID ) is invoked.

Is that correct?

Thanks a lot,

Nir.

Yes, you will call portYIELD_FROM_ISR(TRUE). You can see that behavior here: FreeRTOS-Kernel/port.c at 4832377117b4198db43009f2b548497d9cdbf8da · FreeRTOS/FreeRTOS-Kernel · GitHub

Your second question is exactly correct. In the pico port, the FIFO interrupt is configured for both directions between the cores and either core can trigger portYIELD_FROM_ISR(TRUE) for the opposite core. This allows core B to reschedule core A for reasons other than tick. I.E. a write to a queue can cause a higher priority task(s) to unblock.

Hi Joseph,

With regard to locks you previously answered:

“The kernel data in the is protected by two hardware spinlocks. These are the TASK lock and the ISR lock. You will need to ensure you have implemented portGET_TASK_LOCK(), portRELEASE_TASK_LOCK(), portGET_ISR_LOCK() and portRELEASE_ISR_LOCK(). These need to be implemented in such a way that NO CORE can get past the lock. On the RP2040 this is a hardware spinlock so when the second core reaches the GET, it will halt until the other core executes RELEASE.”

Are these 2 locks supposed to be recursive when lock() is invoked repeatedly by the same core? or by the same task?

I’m asking this because I noticed that in the SMP port of RP2040 the define portENTER_CRITICAL() is changed from vPortEnterCritical() to vTaskEnterCritical(), and when I implement that for r5 I get a deadlock when prvYieldForTask() is invoked from inside a critical section.

Thanks,

Nir.

the locks are not the same as the portENTER_CRITICAL and are only used to protect the scheduler. Since the scheduler can be entered by either core both from an ISR or from “mainline” code the task locks are required. These should not be recursive and the scheduler does not call them more than once before releasing them.

I will have to look more closely at the RP2040 version.

writing on behalf of @Nir
Hi Joseph,
We’ve archived certain progress.
We’re able to run a tasks on both cores. Meanwhile we disabled data cache and will deal with it later.

Following the RP2040 example I’ve implemented recursive spinlocks.

I created two very simple tasks. First task gives a semaphore, and second task takes the semaphore.
I set the affinity of first task to core0, and second task to core1.

If the second task waited for the semaphore and made it into MinimalIdleTask, the first task always ran into handler abort somewhere inside SemaphoreGive.

Investigating RP2040 I found that configUSE_TIME_SLICING is defined to 0, where as in my project it is defined to 1.
Changing configUSE_TIME_SLICING from 1 to 0 in my project changed things dramatically and now both tasks run smoothly.

I see that this define is used only in the context of xTaskIncrementTick( void ) .
I will definitely need time slicing in my project.

Did you try enabling of time slicing with RP2040 ?

Thanks a lot,

Nir.

@jjulich
Hi,
Should time slicing work (configUSE_TIME_SLICING ) in SMP port?

Best regards
Rasty

Yes, that will work normally. i.e. tasks at the same level will fairly share the CPU’s and switch on the tick.

Hi Joseph @jjulich ,

1.For the implementation of portGET_TASK_LOCK and portGET_ISR_LOCK, is recursive lock mandatory or normal spinlock/mutex is also fine?
The document “FreeRTOSMchangedescription.pdf” has this text
"portGET_TASK_LOCK()
This must acquire a spinlock. The lock implementation must be recursive. If it is acquired N times by a
core, it must then be released N times before another core is able to acquire it.
"
2. Is there any test cases for smp support for freertos? The basic tests for normal freertos works, but i want to know whether is there any test cases for SMP features in particular

  1. The document is correct. We have a goal to minimize recursive locks, but my statement was premature.
  2. There are currently NO specific tests for SMP. That is on the list of things to do and we would welcome any submissions in this area.