Hi folks,
I’m working on integrating an SMP port for a device in the NXP S32K3 family with 3x Cortex-M7 cores running at 320 MHz. This chip has per-core ITCM, DTCM, I-cache, and D-cache, but no hardware cache coherency. From my background research and other topics on this forum, this would appear to be a poor candidate for SMP. However, I’m looking to operate the system in a Bound Multiprocessing (BMP) approach in which every task has affinity to only one core. Here is my understanding of where the code can be placed with this configuration:
- Since each task can only run on one core, its code and data can be placed in the core’s local ITCM/DTCM. This is the fastest memory available on the device.
- The FreeRTOS kernel code executes from Flash, but each core can keep their I-cache enabled because the .text is read-only. This should be the same as running in a single-core or AMP configuration.
- The FreeRTOS kernel data (.bss and .data sections) along with the task TCB buffers are placed in a shared RAM section that has D-cache disabled to avoid data synchronization issues between cores. This will be slower than a normal single-core or AMP configuration.
To test this, I’ve mostly ported my existing project running in single-core mode to use an SMP port of FreeRTOS v11.1 provided by NXP for this family of MCUs. However, the impact on performance due to the additional multi-core overhead has been pretty shocking. My project has an interrupt that is triggered every 50 μs and unblocks a FreeRTOS task to handle the processing using a task notification. This works great in single-core mode, with acceptable overhead from the context switching in the kernel:
Bare ISR (no task notification) - 10-13 μs total
ISR w/deferred interrupt handler (via task notification) - 16.5-17.5 μs total
This represents an overhead of approx. 4.5-6.5 μs, which is not bad. However, in SMP mode with both other cores running their idle tasks - i.e., the main core runs the same number of tasks and the same ISR as before:
Bare ISR (no task notification) - 10-13 μs total
ISR w/deferred interrupt handler (via task notification) - 53-57 μs total
This is longer than the period of the interrupt and is therefore unusable. Here’s an example of the average execution time between a couple of key functions in single-core vs. SMP mode (measured with ETM trace):
vTaskSwitchContext
: 131 ns → 5.2 μs
PendSVHandler
: 176 ns → 5.9 μs
ulTaskGenericNotifyTake
: 4 μs → 33 μs !
The SMP port provided by the vendor also defines the portENTER_CRITICAL()
and portEXIT_CRITICAL()
macros to use vTaskEnterCritical()
/vTaskExitCritcal()
instead of the simpler vPortEnterCritical()
/vPortExitCritical()
functions used in the single-core mode. These also take almost 2 orders of magnitude longer than in the single-core case. Some of this is taken by the many calls to portGET_CORE_ID()
and the time waiting for the ISR/task spinlocks to be acquired (sometimes up to 1.5 μs).
I’m trying to determine whether my fundamental understanding of the limitations of SMP is incomplete and there is a reason why the performance is so limited on this particular hardware, or whether something is wrong with the vendor-supplied SMP port (to which I had to make several modifications) and the performance impact can be mitigated to some extent.
Does anyone have rough (order of magnitude) timing data comparing a single-core vs. SMP port on their particular device?
Thanks in advance!
Edit: To clarify, I’m mostly curious about A/B tests of single-core vs. SMP with the same FreeRTOS configuration. I currently have things like configASSERT()
and stack overflow checking enabled, but these apply to both ports equally.
The only exception is that the SMP mode doesn’t support configPORT_OPTIMIZED_TASK_SELECTION
, so I had it disabled in my SMP tests but kept it enabled in single-core mode. I don’t think this would have much of an impact in this particular case since the task notification directly yields into the task (the kernel doesn’t have to select which task to run).