FreeRTOS SMP on Coretex-R5F

Dirk · July 23, 2024, 2:13am

Hi all,

I am using TI AM6442 and AM2432 ,the SDK of these two chip provides the FreeRTOS example of R5F core, however, it deploys 4 FreeRTOS on 4 R5F cores(like AMP), which cannot meet our needs. What I want to do is deploy an SMP FreeRTOS on these 4 R5F cores, so I have the following questions：
1.Based on the hardware structure of these two chips, is it possible to deploy SMP FreeRTOS on four R5F cores?(the structure of these two chips is as follows）

2.If the answer to question 1 is possible, What kind of work should I do and what kind of process should I follow?(the SDK of AM6442 provides the SMP FreeRTOS on A53 core, can it be used as a reference?)

TheSemaphoreNoob · July 23, 2024, 9:22am

Hey there !!

The short answer to the first question is : no.
SMP stands for Symmetric MultiProcessing. It’s meant to work on symmetric hardware. Let’s isolate the part of your 2 hardware with the R5F cores : As we can see, TCM is used. Long story short, TCM is private memory attached to a cores. Let’s number cores from left to right : core 0 to core 3. same for TCM : TCM 0 and TCM 1. From what I understand of the scheme, TCM 0 is linked to cores 0 and 1 : Cores 2 and 3 can’t access it. Thus, there is an asymmetry in the memories. The FreeRTOS repository offers an SMP port for the AM64, but it concerns the A53 cores.
Another limit is about the cache coherence which really is a pain. It’s briefly discussed on TI forums.

Now… I’ve exactly the same problematic with a dual-core MCU based on the Cortex-M7 : No cache coherence mechanism (too expensive for cortex-M) and TCM. My current mission is to make the SMP configuration work on it. At this point, I’m pretty sure it’s possible, I see some stuff in my debugger that convince me of it, I “just” have stuff to debug.
But there are limits and constraints : You can’t expect it to run without using the core affinity mechanism. This, with cache maintenance operations and some good practices, is the only way to deal with the lack of cache coherence. And it’s also a way to handle the issue of private memory : If a task is always rescheduled on the same core, the date it uses in TCM will remain available.
Soooo… You can give it a try. Is it worth it in terms of dev time ? (I’m on it since ~one month) of performance ? (lower use of TCM)… It’s all up to you to decide if it matches your constraints.

If you wanna give it a try, don’t hesitate to check my profile : my previous topics where about arbitrating between SMP and AMP, or about difficulties while implementing.

Happy multiprocessing !

P.S. : Please keep in mind I’m an intern, I get missioned by my company to take time to test it, without guaranty of results. It’s not realist for the company to let a senior dev (as you probably are) spend time on it.

richard-damon · July 23, 2024, 3:05pm

The answer really isn’t simple, and depends on what you really need. TCM will generally be “private” to the processor(s) it is attached to, thus using it restricts the “symmetric” part of SMP to the processors that share TCM. Thus pairs of R5Fs that share TCM may be able to use SMP assuming the TCM can actually be shared and isn’t just allocated part to one processor and part to another.

When you have a processor with this sort of private memory, you can make FreeRTOS work, but at significant performance cost by limiting the TCM use to tasks that are locked to Cores that access it, but the FreeRTOS kernel itself needs to reside and use the non-TCM memory which will be slower.

The key takeaway, is SMP really want things to be symmetric, and pushing beyond that becomes work arounds with costs.

aggarg · July 25, 2024, 12:24pm

@richard-damon has very nicely answered your question. The only remaining question I have is about your use case - why do you want to use SMP?

Dirk · July 29, 2024, 2:16am

Hi @TheSemaphoreNoob,

I understand what you mean. Based on what you say ,it is possible to employ a FreeRTOS on the 2 CPU cores corresponding to each TCM, but it is impossible to employ a FreeRTOS on the 4 CPU Cores due to the structure. And I see that there is a 2MB SRAM , so my another question is, can this SRAM serve as a shared cache？

Dirk · July 29, 2024, 2:28am

Hi @richard-damon ,

I think what you say is whether I can deploy SMP FreeROTS on a CPU with dual R5F cores depends on its TCM structure, and it is impossible to deploy an SMP FreeRTOS on the 4 R5F cores, right? Because the two 128K TCM are independent of each other. For every TCM ,if it is shared by its corresponding 2 R5F cores, it is possible to employ a SMP FreeRTOS for these 2 cores.

Dirk · July 29, 2024, 2:43am

Hi @aggarg ,

There are two considerations：
1.We want to tap into the computing power of this platform as much as possible, so the SMP structure maybe a good way to achieve this.
2.We want to provide a platform so that the software development engineers do not need to consider which CPU core should the task be developed, they only need to develop the task software and determine the priority of tasks,and task scheduling is entrusted to SMP FreeRTOS.

TheSemaphoreNoob · July 29, 2024, 8:12am

Hiii

I’m not sure if I understand correctly. (S)RAM and cache are not the same thing.
The S in SRAM is for “Static”. It’s about the physical implementation : it uses transistors as latch to store bits, so it doesn’t require periodic refresh as DRAM (which is typically referred as just RAM.) it makes SRAM faster than DRAM, but the counterpart is a lower density (i.e. more physical surface for the same amount of memory) and higher price than DRAM. except that, SRAM is RAM. If we see it on a scheme, it means it is used, and implemented as RAM.

Cache often use SRAM hardware tech, but it’s the only common point. Cache is implemented as close as possible to the cores. The goal is to reduce the length of the “cables” between cores and cache : the less the path is, the less the access time is. Cache uses specific algorithm to optimize the accesses, while RAM is more direct, so you have constant but slower access times. There is too many differences. You probably can “use as” but you will certainly not have the same performances. This, imo, is the fundamental reason of why we have different techno, with pros and cons, and have to chose among them depending on the application.

It is not exactly impossible… just not meant for. Actually I’m doing it, using SMP on a dual core with TCM memories. But I’ve some concessions to do, I can’t use the full potential of the TCM.

I’m confused… You are not thinking of doing 2 SMP (one per pair of R-cores) and make them communicate as two AMP, are you ? with a SMP conf on the 2 A cores ? It sounds an incredibly tricky system ! The question of @aggarg is without a doubt the more pertinent one.

From what I experienced, the differences between SMP and AMP are more about how memory is used than how MIPS are used.

I understand… But will it be really easier ? If you wanna use all the calculation power, then I think you will try what I said sooner… is “an AMP of 3 pairs of SMP” – or something like that – really simpler than “an AMP over the 6 cores” ? For me, it will be a mess. It’s not such a mental load to think about “this task must be on this core”. For inter-core communication there are example in datasheet. You can use it. And it’s not necessary, to have message buffers between every single tasks. message buffers between cores is enough, with a task on each core reading these buffers, and acting as a dispatcher.
It seems a “larger structure” but it is a clearer one imo. “The better is the enemy of the correct”, Voltaire, La bégueule

richard-damon · July 29, 2024, 11:05am

The problem is that the hardware just doesn’t support that level of abstraction. The use of Tightly Coupled Memory doesn’t work in the concept of not caring which processor runs a task. A processor designed for that sort of operation would replace that TCM with a large cache, probably with 2 levels, and a perhaps a larger global memory.