SMP vs AMP scheduler execution in FreeRTOS

I am starting a new project with STM32H747 which comes with cortex M4 and M7 dual cores. I am confused about the following in terms of the multicore implementation of FreeRTOS:

  1. When using SMP, there is only one copy of the kernel/scheduler, in this case, how is it decided which core will the kernel run on? Obviously, the scheduler will need to interrupt an ongoing thread on one of the cores, so is the core selected randomly or is there some other algorithm?
    Is the configTIMER_SERVICE_TASK_CORE_AFFINITY for the same purpose? I think something similar is there for Pi Pico “configTICK_CORE”.

  2. Can I use SMP for STM32H747 because even though the cores are different (M4 and M7), the instruction set architecture is the same ARMv7-M, so essentially, the same port can be used?

  3. In case, SMP is not possible and I use AMP, can the usual semaphore, mutex, and queue APIs be used to achieve proper synchronization?

  1. I will admit that I haven’t studied the SMP ports in detail, but I would expect that part of the scheduler needs to run on every core. When the task on that core blocks, or unblocks another task, then that core needs to decide if it needs to change what task it will run, or it it needs to notify some other core that it needs to change what task it is running. It would be very inefficient for that core, if it wasn’t the “chosen scheduler” core to pass of that decision to the other core and wait for an answer. Since SMP really does assume cores are symetric (for the most part), there doesn’t need to be a “chosen core” for the scheduler. The Tick interrupt likely is pointed to one specific core, so that one will do more of the scheduling, but it in no way is limited to it.

1B) The Timer_Service task, which handles timer requires, is just like any other task, with no “special” powers (except that it handles a class of API calls), so you can give it an affinity if there is some reason you want timer callbacks to be done from a given core. (I sometimes wonder if it might not make sense for some systems to run multiple timer service tasks working off the same queue.

  1. I don’t know that processor in particular, but, I suspect that like many similar processors, the M4 and the M7 have “private” tightly coupled memory that the processor is designed to run most of it code out of, and this trying to make it look “Symetric” difficult. You might be able to put the kernal code and the kernal objects (like semaphores and queue) in shared memory, and make sure anything in the “private” memory is only accessed by things running on that core.

  2. With AMP, most of the primatives do NOT work between cores. Stream Buffers / Message Buffers have special code that allows them to cross core boundries, but as far as I know, the other primatives can not be used that way.

  1. As @richard-damon said, tick interrupt happens on one core and it interrupts other cores as needed. configTIMER_SERVICE_TASK_CORE_AFFINITY defines which core the timer task should run on (by default, it can run on any core i.e. no affinity).
  2. As @richard-damon said, if the memory is shared between cores, you should be able to.
  3. Here is an example of using Stream Buffers to enable cross core communication in AMP setup - FreeRTOS - multicore (dual core) inter core communication example on STM32H745I Discovery board from ST.

Thanks, Richard and Gaurav,

Based on your response, this is what I understand:

  1. Tick runs on one fixed core(local).
  2. At every tick it is determined whether a task needs to be unblocked/blocked.
  3. If the task to be (un)blocked is on the remote core then an interrupt is generated for that core and the local core can resume normal work.
    Is this correct? Will this not cause overhead from the tick to mainly affect the local core?

Another question, when is the core for the next ready task decided in global non-fixed partitioned scheduling? Does that happen every tick and even a local core can schedule the next ready task on the remote core?

Not quite.

  1. is correct.
  2. is just partially correct, yes, on every tick we evaluate if some task becomes ready, and perhaps do round robin scheduling, but the scheduler is also activated from ANY core that is processing a FreeRTOS API call that either blocks the current task (to wait for something) or unblocks another task (by, perhaps, providing what it is waiting for).
  3. Tasks, in general, are not tied to a specific core (you MIGHT set affinity for some reason, but it is not required) so when the scheduler that some ready but not running task should be run now, it determines what task that is running should be stopped (unless it just blocked it) and if the core that it will be scheduled on is different than the current core, signals that core with an interrupt.

Yes, which ever core is processing the tick interrupt will have a higher load due to it doing the tick processing, so if you have a high priority computationally heavy task you might want to give it an affinity to avoid the core processing the tick.

I would need to look at the details of the scheduler, but every core can request any other core to do a context-switch and make its current task yield, I don’t know if it can actually specify what task that core will run, or if the other core will look at the task list itself to decide (that would be a simplicity vs efficiency trade).

1 Like

The other core looks at the task list and decides.

I don’t have any experience with the STM32H747x but I do have the datasheet (DS12930 rev 2) and the Reference Manual (RM0399 rev 4), so of course I’m an expert and everyone should defer to my vast lack of knowledge (that is, don’t believe a word I tell you, or at least keep in mind I’m basing the following on my interpretation of things I see in the DS and RM, not from direct knowledge).

The short(ish) answer is that no, you cannot use SMP between the M7 and M4 cores (at least not in a way that takes advantage of the M7 or all of the available SRAM). Though they share the same instruction set, and can access the same physical peripherals, and some of the same SRAM, they do not share all of the SRAM you would need to be able to migrate running code between the cores without some major headaches.

If you look at Table 6 in the reference manual, you will see that the first 64 KiB of address space maps to different memory (Instruction TCM for the M7, remapped boot flash or memory (“VTOR Remap”) on the M4. With the right linker control files, this can be overcome to an extent. The memory itself will be different, but it can probably be arranged that the contents are the same. The first real problem I see there is that there’s a 128 KiB address range 0x1FF00000-0x1FF1FFFFF (labeled “System Memory” in Table 6) is only available to the M7 core, and the range 0x20000000-0x2001FFFF is M7 data TCM, also not visible from the M4. If I look back on table 2 in the reference manual, I see that although it is possible to access the 512 KiB SRAM on the AXI bus from the M4, it appears not to be “useful” to do so for either I or D bus access, only for S bus. The M7 has “useful” access to the SRAM connected more closely to the M4 complex, though the performance will be reduced because the AHB bus is 32 bits wide vs. the 64-bit wide AXI bus.

I don’t really have the time to work out all of the details, but it seems to me that SMP would be quite a mistake. Though the cores share a large part of the address-space mapping, they should be treated as separate systems. With care, you can share data, but I would not recommend sharing a single copy of code between them outside of the flash. The software running on the two cores should be aware of each other, and hardware semaphore should be used where resource contention and other race conditions may be an issue.

I am not sure how this is done, as I’ve not had access to these particular parts - I would pretty much build the images for the 2 cores as separate programs, and combine them in a final link or objcopy stage (not sure how the CubeIDE or other SDK tools handle it).

2 Likes