Multicore advice and suggestions

ldb wrote on Wednesday, April 10, 2019:

I have ported freeRTOS successfully onto all 4 cores of the CortexA53.

To do it what I basically did is put an “assignedCore” field into the TCB struct so any task when running is assigned to a core.

I packed a lot of the static global data into a core control block (coreCB) so things like pxCurrentTCB are in there and it is statically assigned to core number so it looks like this

#define MAX_CPU_CORES 4
static struct CoreControlBlock
	volatile TCB_t* pxCurrentTCB; /*< Current task running on core */
  /* blah blah blah */
} coreCB [MAX_CPU_CORES] = { 0 };

A task can for example know it’s cores current by

Now for the gotcha there are functions that take NULL as a handle specifically this macro

#define prvGetTCBFromHandle( pxHandle ) ( ( ( pxHandle ) == NULL ) ? pxCurrentTCB : ( pxHandle ) )

So it appears you can write vTaskDelete(NULL), I assume meaning delete the current task but I have 4 current tasks :slight_smile:

Those ones are relatively easy I can just stop you ever using NULL by making it illegal but the one that has me scratching my head
Does uxTaskPriorityGetFromISR(NULL) really mean get the current task priority from some random interrupt and when is it ever used.

It is a drop dead because the interrupt is probably being handled by a different core and so there is no possible translation of that I can think of.

richard_damon wrote on Wednesday, April 10, 2019:

The NULL task pointer shouldn’t be a problem, as you should be able to know which core a give thread of execution is executing on, so ‘current task’ would be the curent task of THAT core. So vTaskDelete(NULL) makes a lot of sense, and can be implemented. You can even use that inside an ISR, it would pertain to the task that the ISR interrupted on the core that the ISR was attached to. Now, there may not be a lot of reason that makes much difference to the ISR, becausae it doesn’t mean as much, since other cores are running other tasks.

rtel wrote on Wednesday, April 10, 2019:

It sounds like you have created your own SMP version of FreeRTOS
(symmetric multiprocessing, where there is only one instance of FreeRTOS
that is scheduling tasks across multiple cores). Is that the case?
There are a few of these around (like the version Espressif have for
their dual core ESP32) so its an interesting topic.

ldb wrote on Wednesday, April 10, 2019:

@Richard Damon Ah yes … getCoreID which is just assembler read or the Core ID register will do that on those other pesky calls.

The Irq one won’t work because the IRQ are just flagged and immediately cleared. A seperate task will note the irq flag and go handle them so the task and the irq coudl end up on seperate cores.

All the serious interrupts are on the FIQ system so FreeRTOS doesn’t interfer with them which is sort of like the M4 question earlier.

ldb wrote on Wednesday, April 10, 2019:

@Richard Barry no there are 4 versions of FreeRTOS running there are even 4 idle tasks there is no master core and no core ever sleeps. At the front end the tasks pull from a long master task queue as well as the local queue which is the only change from normal FreeRTOS.

So there are 5 queues, 4 local normal FreeRTOS ones and one larger system one. The heavy timing sensitive are assigned to specific cores and the cores pick over the lesser tasks from the system queue as they see fit.

richard_damon wrote on Wednesday, April 10, 2019:

If you have 4 distinct versions of FreeRTOS running, with 4 seperate task lists, then NULL is not a problem, as each each FreeRTOS version has its own ‘current task’ to refer to. From your description though, it sounds like you DO have a ‘single’ version of FreeRTOS, with some information being kept specific for each core, and some information global shared with all cores.

If the IRQ just kicks a task, then the ISR doesn’t need to know about ‘current task’ and the task handling it of course know who the ‘current task’ is, it is them (when it is running)

ldb wrote on Wednesday, April 10, 2019:

Calling it a single system is sort of weird, they are completely normal FreeRTOS it is just the idle pulls low priority tasks from the long queue rather than spin around in a deadloop. Obviously the long queue has locks to stop all the cores hitting it at once that is the only thing shared between the cores.

I can also completely shutoff one FreeRTOS system and the other 3 FreeRTOS systems will merrily chug along completely oblivious I just lose the specific tasks that were assigned manually to that 1 core.

I am not sure I would call it one FreeRTOS system, I would tend to call it 4 parallel FreeRTOS systems with a shared low priority task queue.

ldb wrote on Wednesday, April 10, 2019:

One other quick one is uxSchedulerSuspended actually a bool or is it a count?

It is set to and compared to pdFalse but it is incremented and decremented like it is a count so I am confused.

richard_damon wrote on Wednesday, April 10, 2019:

uxScheduler is a count, as it can be uxSchedulerSuspend/Resume can be nested, and the resume doesn;t actually resume until the count gets back to zero.

The key difference to one/four version is there one set of machine language instructions in memory, accessing (and a give core/thread finds the right set of core specific globals) or are there multiple independant instruction streams, each compiled with its own private set of globals, that the other copies can’t reference, with limited communication boxes between them.

The first is clearly a SMP (or very close to it) system, with all the power and issues of SMP, the second is getting much more like AMP.

Part of the key difference is how much you need to worry about sharing between cores. One key advantage of AMP is that sharing tends to be fairly explicit, so can be simpler, and cross core communication tends to be explicit. With SMP, what sharing is cross core and what sharing is intra core is often less well defined, and thus you need to take more care about things.

ldb wrote on Thursday, April 11, 2019:

Thought that might be the issue I had mappped it in a single bitfield

struct {
    volatile unsigned uxPercentLoadCPU : 16;
    volatile unsigned uxIdleTickCount : 16;	
    volatile unsigned uxCPULoadCount: 16;
    volatile unsigned xSchedulerRunning : 1;
    volatile unsigned xYieldPending : 1;
    volatile unsigned uxSchedulerSuspended : 1; /* WOOOPSSSSIE */

I have changed it to 16 bits and changed the compares from pdFalse to 0 so I don’t fall for that trap again :slight_smile:

Now if you really want to try and class the system in those terms I would say it is BMP because there are tasks bound to each core. However for me it fails the definitions of SMP, AMP and BMP because there is absolutely no IPC … not even a simple one.

It gets worse if you talk about address space because the code can run 1 copy of freertos.obj for all 4 cores or it can run 1 copy of freertos.obj per core. You make that decision based solely on the L1/L2 cache connections to the core. A cortexA53 has all 4 cores sharing L1/L2 cache so you run one 1 freertos.obj and the 4 cores run thru it but use there own coreCB data block (It looks exactly like a C++ object). On a CortexA75 each core has its own L1/L2 cache so you would have 4 copies of FreeRTOS.obj and each core would run thru it’s own block. The code block doesn’t care which way you do it as each core has it’s own data block.

So on address space it would be classed SMP & AMP depending how I set it up :slight_smile:

I think that sort of definitions just muddies the water all in an effort to give it a name … it is what it is.

richarddamon wrote on Thursday, April 11, 2019:

Perhaps one thing that muddies the picture here is that I am not sure you are keeping some of the assumptions that at least I see as part of the fundamental model in using FreeRTOS. One key aspect is address space. FreeRTOS has an implied very simple address mapping, Logical Address Space in a task = Physical Memory Address Space (i.e. there is no MMU remapping of addresses). Any task can give any other task (or ISR) and address, and it can access that location without any overhead of re-mapping the address to its address space. There are some restricted task (that are sort of second class citizens) that might not be able to do some access to that location, but its address is always the same for every task that wants to use it, and no special function needs to be called to get such a chunk of memory. One thing this implies is that things like Thread Local Storage (or Core Local Storage) isn’t done by address remapping, but any routine that wants to access TLS can’t just assume some fixed address that will change memory mapping, but need to access a pointer in the Task Control Block. This makes TLS access a bit more awkward, but says that task switching can be fast and efficient. One implication of this is that if multiple tasks share the same base code, they share ‘globals’, and if they need private copies of data, they need to refer to the TCB to get the address of their private data, or keep a pointer on the stack.

Part of this assumption is rooted in the fact that FreeRTOS runs on many processors, many which don’t have a MMU, so it can’t assume the ability to remap addresses to provide TLS at a common address, and the optional restricted tasks only need a MPU, not a MMU so it would be possible to create a MPU port for a processor without a real MPU that work just the same, except that the restricted tasks aren’t really restricted.

If you keep that rule for a multicore version, then if all the cores are running off of the same copy of code, and all use the same address space, so to get to the core specific data structures, they need to get their core-id and select the right block for it. This means it doesn’t matter if there is a share cache or distint caches for the cores, as address X is address X. Perhaps it makes sense that a multi-core varient would assume a MMU and use it to provide some limited Core Local Storage to simpify some operations, but that starts to depart from some of the core principles behind FreeRTOS, but some of that happens anyway in a multicore system.

I think here the distintion between SMP and AMP (with BMP somewhat in the middle) is that the AMP model puts a FreeRTOS ‘system’ on a single core, and the primary interactions are between other tasks on that core, using resources that are basically dedicated to that core. There may be stuff going on in other cores, but that ‘Them’ and not ‘Us’, and communication to them is done differently, perhaps even through a different API, or at the least the code behaves someone differently when doing so. That is the Asymetric part.

In SMP, the whole system works together and you don’t really care what core a given task might be on, at the task level you just talk to it through the basic services and things happen. There isn’t an Us vs Them on cores and it is all just We. Tasks might be locked to specific cores for efficentcies or to promote scheduling but the basics of inter task communication don’t assume that. This is ‘Symmetric’.

In between is the Bound Multiprocessing, where tasks don’t move core to core, and maybe you bind a group of them together on a given core to have a more efficent inter-task communication. The OS still deals with things multi-core with perhaps a bit of a mix of AMP and SMP techniques within it.

ldb wrote on Thursday, April 11, 2019:

I completely agree with everything you and much of what you said is behind decisions that are somewhat forced on me.

On the Pi there is a minimum of a 20 fold penalty for not running with the MMU. It catches people out but it means as a minimum you are forced to at least a 1:1 MMU mapping just to run FreeRTOS as is. So your memory mapping problems as well as ideas are very much in play.

The second sets of decisions was that it in pointless to support 8bit stuff at all because you don’t get multicore 8 bit CPU’s and almost all have to do aligned 16, 32 and 64 bit loads and stores. So some of the types for this is preordained to a minimum 16bit size for anything. The variable above (xSchedulerSuspended) is typical of one of those where 16bit easily gives you enough nesting depth but it is pointless going smaller because the load/store will actually be slower if you made it a byte for example. If you want to save memory space you bitfield pack (which I did above) because these processors can load and roll usually in a single instruction so it cost you nothing in speed.

How or if I could ever merge this back into FreeRTOS itself was a question for down the track once I see it’s final form. It’s all running and I am writing it up to put up on GitHub so I will shot a link up when it’s up which should be by weekend. My old single core port is still up but I am about to revamp it with a 1:1 MMU mapping to get the speed back to where it should be