SMP Performance Benchmarking

fvo · December 10, 2024, 12:22am

Hi folks,

I’m working on integrating an SMP port for a device in the NXP S32K3 family with 3x Cortex-M7 cores running at 320 MHz. This chip has per-core ITCM, DTCM, I-cache, and D-cache, but no hardware cache coherency. From my background research and other topics on this forum, this would appear to be a poor candidate for SMP. However, I’m looking to operate the system in a Bound Multiprocessing (BMP) approach in which every task has affinity to only one core. Here is my understanding of where the code can be placed with this configuration:

Since each task can only run on one core, its code and data can be placed in the core’s local ITCM/DTCM. This is the fastest memory available on the device.
The FreeRTOS kernel code executes from Flash, but each core can keep their I-cache enabled because the .text is read-only. This should be the same as running in a single-core or AMP configuration.
The FreeRTOS kernel data (.bss and .data sections) along with the task TCB buffers are placed in a shared RAM section that has D-cache disabled to avoid data synchronization issues between cores. This will be slower than a normal single-core or AMP configuration.

To test this, I’ve mostly ported my existing project running in single-core mode to use an SMP port of FreeRTOS v11.1 provided by NXP for this family of MCUs. However, the impact on performance due to the additional multi-core overhead has been pretty shocking. My project has an interrupt that is triggered every 50 μs and unblocks a FreeRTOS task to handle the processing using a task notification. This works great in single-core mode, with acceptable overhead from the context switching in the kernel:

Bare ISR (no task notification) - 10-13 μs total
ISR w/deferred interrupt handler (via task notification) - 16.5-17.5 μs total

This represents an overhead of approx. 4.5-6.5 μs, which is not bad. However, in SMP mode with both other cores running their idle tasks - i.e., the main core runs the same number of tasks and the same ISR as before:

Bare ISR (no task notification) - 10-13 μs total
ISR w/deferred interrupt handler (via task notification) - 53-57 μs total

This is longer than the period of the interrupt and is therefore unusable. Here’s an example of the average execution time between a couple of key functions in single-core vs. SMP mode (measured with ETM trace):

vTaskSwitchContext: 131 ns → 5.2 μs
PendSVHandler: 176 ns → 5.9 μs
ulTaskGenericNotifyTake: 4 μs → 33 μs !

The SMP port provided by the vendor also defines the portENTER_CRITICAL() and portEXIT_CRITICAL() macros to use vTaskEnterCritical()/vTaskExitCritcal() instead of the simpler vPortEnterCritical()/vPortExitCritical() functions used in the single-core mode. These also take almost 2 orders of magnitude longer than in the single-core case. Some of this is taken by the many calls to portGET_CORE_ID() and the time waiting for the ISR/task spinlocks to be acquired (sometimes up to 1.5 μs).

I’m trying to determine whether my fundamental understanding of the limitations of SMP is incomplete and there is a reason why the performance is so limited on this particular hardware, or whether something is wrong with the vendor-supplied SMP port (to which I had to make several modifications) and the performance impact can be mitigated to some extent.

Does anyone have rough (order of magnitude) timing data comparing a single-core vs. SMP port on their particular device?

Thanks in advance!

Edit: To clarify, I’m mostly curious about A/B tests of single-core vs. SMP with the same FreeRTOS configuration. I currently have things like configASSERT() and stack overflow checking enabled, but these apply to both ports equally.

The only exception is that the SMP mode doesn’t support configPORT_OPTIMIZED_TASK_SELECTION, so I had it disabled in my SMP tests but kept it enabled in single-core mode. I don’t think this would have much of an impact in this particular case since the task notification directly yields into the task (the kernel doesn’t have to select which task to run).

richard-damon · December 10, 2024, 1:21am

First, task notifications does not directly yield to the task, it just marks the task ready, and tells you that it was higher in priority then the currently running task, the scheduler will still need to find it.

Second, SMP does have overhead, which can be significant, especially when you need to run the kernel in memory not optimized for program execution.

Since all your tasks are core-locked, you are not going to get many of the advantages of SMP, and all the disadvantages. For this sort of configuration, I would suspect having each core run its own AMP version of FreeRTOS, and using core-to-core interrupt signalling for the notifications between cores to be better.

fvo · December 10, 2024, 3:30am

Thanks for the quick feedback!

I’m currently following the task notification with a call to portYIELD_FROM_ISR(), which I believe directly runs the task after the ISR returns.

One of the main motivations for using SMP in my application is that I’m using an AUTOSAR-compliant library for perpheral configuration (MCAL) which is designed to manage resources under the assumption that it’s operating in an SMP-style configuration (AUTOSAR Classic). If FreeRTOS SMP can deliver sufficient performance, it would make the project far simpler than having to manage effectively 3 separate single-core projects, each with an instance of FreeRTOS, the MCAL, separate linker files + binaries, etc. The ability to use FreeRTOS inter-task communication primitives would also help a lot in that regard.

richard-damon · December 10, 2024, 3:49am

For an M7, the ISR will return, and then the PendSV ISR will chain in, and the scheduler be run on that core, and when that returns, it will be to the task that has been activated.

The task notification call does not do the work of changing who you will return to, but will just move that task to the ready list, so that the scheduler will run it. If it was higher in priority than the currently running task on that core, it will be at the front of the list, but the code still just puts it on the list, and points to that priority as the current highest priority.

All that code to do the notification, and the resecheduling, will by necessity be in the slower shared memory, which illustrates the cost of “shared memory” for that task, especially for the data memory accesses which due to the lack of the ability to use caches, will be slow.

You are the only one who can judge if the slowdown cost makes it worth do it that way, but the impression I get is that ARM doesn’t consider the Cortex-M series really to be designed for SMP, but are AMP processors (which is why you get all the procesosr specific capabilities like TCM).

Note, FreeRTOS does support the use of Stream and Message Buffers as inter-core communication primatives.

fvo · December 11, 2024, 4:22am

Thanks Richard, your explanation makes sense to me. It seems like my two options are to continue with the SMP port and execute this particular ISR purely in an interrupt context (outside FreeRTOS) or to move to an AMP configuration.

Fresh · December 12, 2024, 2:10am

Hi fvo,

Thank you for sharing this information. Although I don’t have the timing data, I
would like to break down the call sequence in your application and discuss with you
whether it’s possible to improve the performance in SMP (Symmetric Multiprocessing).

I have broken down the call sequence and listed only the implementations that differ
between single-core and SMP, based on my understanding of your question. This assumes
the system has only one task for the deferred interrupt handler and that interrupts
occur every 50 us. If my understanding or assumptions are incorrect, please help me correct them.

The following step takes 10 - 13 us. This common to single core and SMP.

ISR happens and calls ISR handler

The time taken for the following steps differs: 4.5 to 6.5 us (single core) versus 43 to 44 us (SMP). Only the functions with different implementation are listed.

ISR handler calls xTaskGenericNotifyFromISR to wake up the task for deferred ISR
- taskENTER_CRITICAL_FROM_ISR();
- prvYieldForTask( pxTCB ) : Scheduler put the task back to ready list and request a core to yield for the task.
- taskEXIT_CRITICAL_FROM_ISR( uxSavedInterruptStatus );
PendSVHandler chained in to switch in the task
- vTaskSwitchContext( portGET_CORE_ID() )
  - prvSelectHighestPriorityTask()
task leave ulTaskGenericNotifyTake
- taskENTER_CRITICAL();
- taskEXIT_CRITICAL();
task perform deferred ISR job
Task calls ulTaskGenericNotifyTake again to wait for next ISR
- vTaskSuspendAll();
- taskENTER_CRITICAL();
- taskEXIT_CRITICAL();
- taskENTER_CRITICAL();
- taskEXIT_CRITICAL();
- xAlreadyYielded = xTaskResumeAll();
- taskYIELD_WITHIN_API();
PendSVHandler to switch in the idle task
- vTaskSwitchContext( portGET_CORE_ID() )
  - prvSelectHighestPriorityTask()

Some observations from the breakdown:

Cortex-M7 supports LDREX/STREX instructions. portGET/RELEASE_TASK/ISR_LOCK() can be implemented using these instructions. If spinlock acquisition can take up to 1.5 us, it may be the situation that some core is running in the critical section. If we can obtain information about the time spent in the portGET/RELEASE_TASK/ISR_LOCK() macros, we will gain better insight into breaking down the time spent in the following functions.
- taskENTER_CRITICAL_FROM_ISR();
- taskEXIT_CRITICAL_FROM_ISR( uxSavedInterruptStatus );
- vTaskSwitchContext( portGET_CORE_ID() );
- taskENTER_CRITICAL();
- taskEXIT_CRITICAL();
In RP2040, portGET_CORE_ID() is implemented by reading the SIO_CPUID register,
so it may not be the performance bottleneck. If we know the time spent on portGET_CORE_ID() in your platform, we’ll have a clearer direction for performance optimization.
vTaskSwitchContext() calls prvSelectHighestPriorityTask() in the SMP implementation. The following kernel configurations also affect the time required for selecting task:
- configRUN_MULTIPLE_PRIORITIES : Single priority mode is designed for applications that assume lower priority tasks cannot run simultaneously. It is suggested to set this mode to 1 to simplify the scheduler logic if single priority is not assumed in your application.
- configUSE_CORE_AFFINITY : Core affinity also affects task selection. We
  can consider to use higher priority instead of core affinity if this also meet your application requirement.
- configUSE_TASK_PREEMPTION_DISABLE : Task with preemption disabled can’t
  be select to yield for higher priority task in prvYieldForTask( pxTCB );
  Suggest to disable this feature if not required in your application.

We would be appreciative if you could provide more detailed timing information
from your application, as it would give us better insight into the performance
issue. Thank you again for sharing this valuable information with us. Your feedback
has provided us with a helpful direction for improving FreeRTOS SMP features.

fvo · December 12, 2024, 3:34am

Hi @Fresh, I actually opened an issue on GitHub yesterday (#1204) discussing potential improvements to portGET_CORE_ID(), which seems to be one of the biggest areas for execution time improvement / simplest things to change. I added some timing information there to use as a point of reference.

I’ve done some further analysis since yesterday with some slightly more accurate numbers now, though there’s still quite a lot of variation due to caching etc. A couple of examples for comparison:

Function	Mean execution time (single core)	Mean execution time (SMP)
`vTaskSwitchContext()`	150 ns	5.2 μs
`PendSVHandler()`	180 ns	5.9 μs

Perhaps the biggest individual impact is in the need to use the more complex vTaskEnterCritical()/vTaskExitCritical() instead of the default vPortEnterCritical()/vPortExitCritical() implementations in single-core mode (using the CM7 r0p1 port as the baseline). The MCAL provided by our vendor (NXP) ends up using these FreeRTOS functions to handle critical sections during peripheral accesses, which also adds additional time to all the tasks and was included in the numbers I saw in my initial post:

Purpose	`vPortEnterCritical()` (single core)	`vTaskEnterCritical()` (SMP)
Enter critical section	100-1500 ns	1,500-3,200 ns
Exit critical section	90-300 ns	3,600-3,800 ns

From some early testing I was able to reduce the execution time of several FreeRTOS functions by ~30% on my system by:

Calling portGET_CORE_ID() only once per function call and storing the value locally.
Passing the core ID to the CRITICAL_NESTING_COUNT macros as a parameter.
Passing the core ID to the TASK_LOCK and ISR_LOCK macros.

I summarised the number of unnecessary calls to port_GET_CORE_ID() in each FreeRTOS function in my issue, for which I’m happy to also submit some PRs. Each call to port_GET_CORE_ID() takes between 200-600 ns on my system, because it must read from an off-core peripheral register.

The function implementation used by the TASK_LOCK and ISR_LOCK macros do use the LDREX/STREX instructions. The average execution time isn’t too bad (on the order of 200-300 ns), but the longer spinlock wait periods (1.5 μs or so) do happen quite frequently as I guess the schedulers on two cores are trying to switch tasks at similar tick intervals.

In my case, configRUN_MULTIPLE_PRIORITIES = 1, core affinity is a hard requirement (because peripherals are mapped to a specific core), and configUSE_TASK_PREEMPTION_DISABLE = 0.

GeorgeC · December 18, 2024, 4:28pm

Hello,

starting from this statement:

by comparing the single core version of vTaskSwitchContext with the SMP version the lock-unlock functions are those that stand out as the main difference:

        portGET_TASK_LOCK(); /* Must always acquire the task lock first. */
        portGET_ISR_LOCK();
.....
        portRELEASE_ISR_LOCK();
        portRELEASE_TASK_LOCK();

I think the problem here is that all critical regions started from all tasks on all cores will compete on the same lock: TASK_LOCK.
Meaning, if a taks on Core1/2 wants a critical section, Core0 scheduler will wait for the same region to be free, even if unrelated.

why not treat the scheduler as an independent resource from all the other resources the user can create ?

All the critical regions the MCAL creates map to only 2 locks, TASK_LOCK and ISR_LOCK, depending from where they are called.

this leads to the question why we do not have the possibility to use independent locks for independent resources ? , e.g. a CAN driver critical section on Core0 should be unrelated to an ETH critical section on Core1/2.

Regards,
George

richard-damon · December 18, 2024, 4:47pm

I think the issue is that the TASK_LOCK and the ISR_LOCK are designed for FreeRTOS use, if you want something independent for your own use, make your own lock and not use the “global” ones. They are designed for the FreeRTOS scheduler information. Sort of like use a mutex instead of a system wide critical section if you only need to keep some tasks from competing for your resource. If the issue is that some external library isn’t doing it right, that isn’t really a FreeRTOS issue.

GeorgeC · December 18, 2024, 4:53pm

I think the expectation was that FreeRTOS would provide such feature, not for each user to invent the smp mutex, the user doesn’t know the internals and uses what is available and “seems right to use” with not so good results as it seems.

Regards,
George

richard-damon · December 18, 2024, 4:59pm

The base part of FreeRTOS doesn’t know the hardware limitations of how many of these SMP Mutexes are available. There is also the question of do those operation actually NEED this sort of exclusion, or is a “regular mutex” good enough.

GeorgeC · December 18, 2024, 5:33pm

Regarding the hardware limitations I think this needs to be on the port side and the user to be able to configure what is available and how many of each.
The generic API should be present and the port specific code can handle any hw specifics and incorrect use coming from the generic layers. Or, if the hw doesn’t support the feature, then disable the generic API, but, in this specific case, if there is no hw support for spinlock there is no SMP at all.
Single core and smp mutex are needed and the application is responsible to use the correct one for each case.
In the example above I think the two smp internal mutexes are used for everything, scheduler and “library” / application resources, even if it makes no sense, leading to the poor performance.
Having dedicated APIs will help fix the problem.
The user API can have a parameter indicating the ID of the resource it’s operating on, thus removing interference between unrelated resources.

thx,
George

richard-damon · December 18, 2024, 6:58pm

I suppose my question is what do you need at the application for a “SMP Mutex” that the ordinary mutex doesn’t handle (since it WILL block a task on a core that is different then the core that tool the mutex first, there is no such thing as a “single core” mutex in FreeRTOS)? At the kernel level, there is a need as the kernel can’t “block” since it is what does the blocking, so needs the spinlock level of interlocking, and that comes with the presumption that everyone using those locks is following the internal rules for such locks (like only held for short deterministic time periods). “I/O” libraries rarely need the sort of global critical section that the kernel needs for the scheduler, and from what I have seen, people will sometimes grab the “Critical Section” from FreeRTOS just because it matches the name of a more general concept, not understanding what it actually is.

With AMP, the code for the FreeRTOS critical section is quick, and can be useful for code that needs a very low overhead protection for a very short piece of operations and thus there are reasonable uses in “user” code when that is what is needed. When you start to get into SMP modes, the cost goes up, and the fact that it becomes a GLOBAL blocking becomes meaningful, so it is much less appropriate for user operations.

I will admit that I haven’t done a lot of SMP work, and it is a fact that much of the FreeRTOS ecosystem was built in an AMP type world, but I also know that the guidelines that are used limit the use of the “critical section” to those very short operations that need to be atomic, so the extra cost in SMP shouldn’t be that bad where they are used, but things might improve as it gets more use in SMP environments and more work is done on it.

You need to remember that FreeRTOS is designed to be very generic, and usable in many environments, and is optimized for the smaller cases, not to be the most powerful for the larger case.

RAc · December 18, 2024, 7:32pm

This looks like a very thorough and eye-opening analysis, thank you for that!

Nevertheless, as mind-boggling as those figures appear, please do not forget that even with the “by multitude worse” OS overhead on SMP over single core that you measured, SMP may still be the better choice!

The key here is that as long as the computations that you distribute over the cores are all CPU bound, they can execute truly in parallel and thus as a whole all execute significantly faster than on a single core, so may easily even out the switching overhead (which of course does not mean you should not look at optimization potential).

If, on the other hand, your threads are mostly I/O bound or, worse, practically so strictly serialized against each other that they do not benefit from concurrency at all, then you might not not need an RTOS in the first place, let alone multiple cores.

So all the figures you present should also be looked at in context.

fvo · December 19, 2024, 3:03am

There’s an issue and recent PR on GitHub to introduce granular locks in SMP mode, but I’m not sure when it’s expected to be ready. This would certainly help reduce lock contention between cores.

It would be nice to have the ability to define separate spinlocks for each MCAL critical section using FreeRTOS primitives to avoid the application layer from needing to define them itself.

github.com/FreeRTOS/FreeRTOS-Kernel

Feature/smp granular locks v4

FreeRTOS:main ← sudeep-mohanty:feature/smp_granular_locks_v4

opened 01:20PM - 09 Oct 24 UTC

sudeep-mohanty

+1918 -573

This PR adds support for granular locking to the FreeRTOS kernel. Description… ----------- Granular locking introduces the concept of having localized locks _per_ kernel data group for SMP configuration. This method is an optional replacement of the existing kernel locks and is controlled by a new port layer configuration, viz., `portUSING_GRANULAR_LOCKS`. More details about the approach can be found [here](https://github.com/sudeep-mohanty/FreeRTOS-Kernel/blob/feature/smp_granular_locks_v4/granular_locks_v4.md). Test Steps ----------- The implementation has been tested on Espressif SoC targets, viz., the ESP32 using the ESP-IDF framework. **1. Testing on an `esp32` target** - To test the implementation of the granular locking scheme on `esp32`, setup the ESP-IDF environment on your local machine. The steps to follow are listed in the [Getting Started Guide](https://docs.espressif.com/projects/esp-idf/en/latest/esp32/get-started/index.html). - Instead of the main ESP-IDF repository, the granular locks implementation resides in this froked repository - https://github.com/sudeep-mohanty/esp-idf. - Once you have cloned the forked repository and setup the ESP-IDF environment, change your branch to [feat/granular_locks_tests](https://github.com/sudeep-mohanty/esp-idf/tree/feat/granular_locks_test). - To run the FreeRTOS unit tests, change the directory to `components/freertos/test_apps/freertos` where all the test cases are located. - Performance tests are located in the `performance` subfolder at the same location. - Setup the target device with the command `idf.py seet-target esp32`. - Select the Amazon FreeRTOS SMP kernel using the `menuconfig` options. To do this you must enter the command `-idf.py menuconfig` -> `Component config` -> `FreeRTOS` -> `Kernel` -> `Run the Amazon SMP FreeRTOS kernel instead (FEATURE UNDER DEVELOPMENT)`. Save the configuration and exit the menuconfig. - Now, build and flash the unit test application with the command `idf.py build flash monitor`. - Once the app is up, you can enter the number of the test case and run the unity test case. TODO ------- - [ ] Test setup for other targets (Raspberry Pi Pico) - [ ] Generic target tests to be uploaded to the FreeRTOS repository. Checklist: ---------- - [x] I have tested my changes. No regression in existing tests. - [x] I have modified and/or added unit-tests to cover the code changes in this Pull Request. Related Issue ----------- - Closes #905 By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

fvo · December 19, 2024, 4:02am

The key here is that as long as the computations that you distribute over the cores are all CPU bound, they can execute truly in parallel and thus as a whole all execute significantly faster than on a single core, so may easily even out the switching overhead (which of course does not mean you should not look at optimization potential).
If, on the other hand, your threads are mostly I/O bound or, worse, practically so strictly serialized against each other that they do not benefit from concurrency at all, then you might not not need an RTOS in the first place, let alone multiple cores.

All very good points!

My particular application has a mix of I/O-bound tasks (CAN, SPI, Ethernet) and partially CPU-bound tasks but with some serialization limitations (high-speed, deterministic control loops whose inputs depend on some ADC readings). After further research it’s looking more and more conclusive that AMP is better suited to my application.

Nonetheless, I’m still interested in improving SMP performance in FreeRTOS, especially given that the optimizations I’ve proposed are simple to implement.

aggarg · December 19, 2024, 4:02pm

As you rightly identified, granular locking for SMP is being implemented to reduce lock contention and thereby, improve performance. Please take a look if that is what you are looking for.

Assuming that you are talking about reducing port_GET_CORE_ID() calls, we have merged it - Pass core ID to critical nesting count macros by felixvanoost · Pull Request #1206 · FreeRTOS/FreeRTOS-Kernel · GitHub.

fvo · December 20, 2024, 2:56am

I noticed that this morning. Thanks for being so responsive! In my original issue (#1204) I also discussed passing the core ID to the spinlock functions, which would achieve another similar execution time improvement. Every SMP port I’ve seen so far needs to get the core ID at some point in the task and ISR lock/release function implementations. This would require a breaking change to add the argument to the following macros:

portGET_TASK_LOCK()
portGET_ISR_LOCK()
portRELEASE_TASK_LOCK()
portRELEASE_ISR_LOCK()

@aggarg is this a change the FreeRTOS team would consider? Feel free to comment here or directly on the issue.

GeorgeC · December 20, 2024, 1:56pm

@aggarg it seems that the proposed granular locking feature is what we need to reduce the unnecessary contention from applications.

regards,
George

GeorgeC · December 20, 2024, 2:14pm

The problem is that the current implementation offers only two spinlocks, TASK_LOCK and ISR_LOCK and their purpose is for internal use. Trying to use these in the user application will lead to the poor performance reported in the OP.

In SMP mode where any MCAL API can be called concurrently from any core critical hw resources need to be protected by spinlocks. If we multiplex all the critical sections on two spinlocks TASK_LOCK and ISR_LOCK we get the poor performance reported by the OP.

It is true that the user (myself included) will grab the FreeRTOS API that “looks right”, and only after digging in the code to discover that it’s not right.

After looking at the the granular locking proposal I think it can be used to remove this limitation and avoid the performance penalty.

In the end, the spinlock api needs to be part of a SMP OS and not each user to invent it’s own.

regards,
George