Shareable MPU regions for DMA on Cortex-M33

Intro

We are currently evaluating FreeRTOS v11.2.0 for the use on a Cortex-M33 using the MPU (STM32 H563 or H573 if relevant).
We are currently not using TrustZone, but this may change in the future.
To offload processing burdens, we plan to use a DMA for many peripherals e.g. UART communication.

Documentation

Regarding this topic, I looked at the documentation of xTaskCreateRestricted, which I think is incorrect:
It refers to xMemoryRegions.ulParameters using portXXX macros, but the code I saw indicates they should use the tskXXX macros defined in task.h.

Did I understand correctly that the documentation is incorrect in this regard?

Sharing with DMA

To me, it looks like the shareable setting for configurable MPU regions is hardcoded as non-shareable:

// portable/GCC/ARM_CM33_NTZ/non_secure/port.c l. 1953 ff
                xMPUSettings->xRegionsSettings[ ulRegionNumber ].ulRBAR = ( ulRegionStartAddress ) |
                                                                          ( portMPU_REGION_NON_SHAREABLE );

However, to allow correct usage of other bus masters like the DMA, the corresponding setting must be portMPU_REGION_OUTER_SHAREABLE.

Did I understand correctly that using tskMPU_REGION_NORMAL_MEMORY currently will result in a setting that is incompatible with the usage of DMA (or another bus master) accessing this memory region?

Assuming I am right, I am confused that I didn’t find more about this. I would have assumed that using DMA to write to a MPU-enabled memory region is a common use case.

The non-secure side of the Cortex-M33 only supports 8 regions, out of which 5 are already used by FreeRTOS itself (protect kernel and task stacks), which only leaves 3 remaining regions.
Assuming that a task which uses DMA-enabled hardware also needs access to a region with peripheral registers, this already uses up 2 (peripherals+DMA buffers) regions, leaving only one region for “everything else”. This makes it likely that the region that includes the DMA buffers also includes other data with different access patterns.

I see the following options to solve the DMA problem:

Option (1): Device memory
We could use tskMPU_REGION_DEVICE_MEMORY instead of tskMPU_REGION_NORMAL_MEMORY, since the shareable attributes are ignored for device memory.
This will reduce performance of accesses to this region, which is both relevant for large objects like communication buffers as well as for objects which might be accessed very often.

Option (2): Always shared
Locally patch port.c to default to use portMPU_REGION_OUTER_SHAREABLE.
This would enable the shareability for all regions, which will decrease performance for other tasks, while being better than (1) for the tasks accessing the buffers.

I have no idea which of options (1) or (2) are better performance-wise, and I also don’t know how relevant the performance difference really is.

Option (3): Additional flag
Add another setting like tskREGION_SHARED_MEMORY and set the shareability according to this flag.

What approach would you recommend? Or is there another solution to this problem I haven’t thought about?

Cache and task switching

There is another potential problem which I encountered on a Cortex-M7 processor using SafeRTOS, which might also be relevant in this case, and I would like to know whether it is realistic that this problem occurs here as well.

Assume the following structure:
Region R contains both a DMA buffer B and some other data D.
The DMA writes to B.
ISR I writes to D. (I is not necessarily related to the DMA, but could also be e.g. a DMA transfer complete interrupt which copies some DMA status registers to D)
Task T1 reads B and D and has access to region R configured.
Task T2 does not use any of this data, and therefore does not have R configured.

When I fires while T2 executes, the MPU does not contain a setting for R, so I accesses D without an active MPU region.
If PRIVDEFENA=1, this would generate a fault.
If PRIVDEFENA=0, which is the default, I accesses D using the “default memory map”, which may differ in cache settings from the setting in T1/R.
This difference in cache settings may prevent invalidation of the cache used by T1.
After a task switch, T1 may access outdated data from the cache.

On the other processor, we enabled the DMA buffer region for all tasks, so even while executing T2, there will still be an active cache setting for R (maybe with access set to privileged-only).
However, the low number of configurable regions for the M33 would be a problem for this approach.

Unfortunately, I was so far unable to find documentation of the default cache settings for my processor, so I do not know which option (if any) would provide compatible caching and shareability settings.

Would this problem also occur on the FreeRTOS Cortex-M33 port or does FreeRTOS implement some other measures (e.g. explicit cache invalidation on context switch)?

Yes, you are right. For M33 port, you should use tskXXX macros.

You are right that the region will be configured with portMPU_REGION_NON_SHAREABLE.

The RBAR register contains the following attributes:

  1. Shareability attributes.
  2. Access permissions.
  3. Execute never.

Out of the above 3, “Access permissions” and “Execute never” are already configurable. So I’d suggest to use your option 3 to make “Shareability attributes” also configurable. We can default to portMPU_REGION_NON_SHAREABLE to ensure backward compatibility. Would you be willing to raise a PR for this?

The architecture allows 8 or 16 MPU regions. Did you check how many MPU regions does this part contain?

This would be same on all M33 ports. I am not sure about a solution to this problem but is it possible to explicitly invalidate the cache in the I ISR?

Is this only for the M33 port? I would assume that the API should be the same for all ports. I looked e.g. at ARM_CM3_MPU and it does it the other way round (sets tskXXX in the internal settings based on portXXX given as parameter), which looks very strange to me.

// in vPortStoreTaskMPUSettings
// portable/GCC/ARM_CM3_MPU/port.c
                if( ( ( xRegions[ lIndex ].ulParameters & portMPU_REGION_READ_ONLY ) == portMPU_REGION_READ_ONLY ) ||
                    ( ( xRegions[ lIndex ].ulParameters & portMPU_REGION_PRIVILEGED_READ_WRITE_UNPRIV_READ_ONLY ) == portMPU_REGION_PRIVILEGED_READ_WRITE_UNPRIV_READ_ONLY ) )
                {
                    xMPUSettings->xRegionSettings[ ul ].ulRegionPermissions = tskMPU_READ_PERMISSION;
                }

// portable/GCC/ARM_CM33_NTZ/non_secure/port.c
                if( ( xRegions[ lIndex ].ulParameters & tskMPU_REGION_READ_ONLY ) != 0 )
                {
                    xMPUSettings->xRegionsSettings[ ulRegionNumber ].ulRBAR |= ( portMPU_REGION_READ_ONLY );
                }

I think for now, I will not be able to make a contribution to make it configurable, unfortunately. We will change tskMPU_REGION_NORMAL_MEMORY to refer to portMPU_REGION_OUTER_SHAREABLE.
Even if it is less performant, it is a less error-prone setting for users which do not look at such details, so I think this would be a better default value.
If you agree, I could make an Issue/PR for that. I believe this would be a better default. If you agree, would you classify this as a bug or enhancement?
I’ve looked at the other ports, and none of them seem to use portMPU_REGION_OUTER_SHAREABLE, so I’m confused that this hasn’t come up yet at some other point.

For this device, the non-secure side only support 8 regions, unfortunately.

I think our application will actually not be affected by this problem, since the regions which are likely candidates for this problem will not have D-cache.

Just for the sake of the argument/future reference: I think it would be a better idea to invalidate all D-caches during a task switch. Otherwise, it would be hard not to miss any constellations where this could happen.
It would probably be more “correct” to only invalidate those which are relevant for affected regions, but doing that separately may degrade performance more than a full invalidation.
Unfortunately, it looks like cache maintenance operations are implementation specific, so this cannot be done reliably by the kernel. :pensive_face:

Maybe there should be a note about that in the MPU manual.

This is for all Armv8-M ports. It should ideally be same but currently is is not. The xxx_PERMISSION are for access control feature and that is different.

I’d prefer not to break backward compatibility. Please open an issue on the FreeRTOS repository for making “Shareability attributes” configurable.

If you want to invalidate cache on every context switch, you can consider overriding traceTASK_SWITCHED_IN or traceTASK_SWITCHED_OUT.

I opened issue 1384 to add this fact to the API documentation.
Diving deeper I think I understand now the need for the tsk* macros: Is it correct that they were introduced because the CM33 uses multiple registers, and the port* macros have clashing bit-definitions?
So using the tsk* macros would be the “proper” way to hide such port-specific implementation details like the port* macro definitions, and this abstraction was just never implemented for other ports?

→ Issue 1383
For now, I will probably go forward with a local patch to allow development to go on.
Thanks for your support so far.

That’s a good idea. Thank you very much!

Yes, that is exactly the reason!

Thank you!