FreeRTOS STM32F4-7 TASK structs in DTCM

glenenglish wrote on Monday, April 24, 2017:

The subject title says it all.
Has anyone put the FreeRTOS data structs that are used most often into the DTCM memory of the STM32F4/7 ?

This would mean the processor doesnt have to get the structures out of cache (possibly splatting the cache ) during a context switch. I already put interrupt handlers in ITCM, might be useful to force FreeRTOS into ITCM, also, for when I have lots of tasks going fast. Depends on the usage case, of course.

FreeRTOS has very fast and cheap context switches… That’s the huge advantage over say, Linux when task switch overhead is compared. ( but not trying to compare linux to freertos here- they are different use cases) .
-glen

rtel wrote on Monday, April 24, 2017:

Never tried. If you statically allocate the structures then it should
be easy enough to put them in DTCM and experiment. How big is the DTCM?

glenenglish wrote on Tuesday, April 25, 2017:

depends

STM32F429 "core coupled memory - data " 64kB (doesnt have to access SRAM via AHB0 which is good for dealing with heavy DMA users)
(STM32F4- no data cache. has instruction cache of sorts to deal with flash speed.)

STM32F7 instruction cache, data cache varies from 4 to 16kB
TCM has direct processor connect like CCM.
DTCM varies 64kB to 128kBytes,

Which is plenty for my work !

STM32H7 : 64kB ITCM, 128kB DTCM… the DTCM is actually 2 x 64kB and can be used dual issue (dual instruction, operand fetch ) if you are on your game.

With the advent of cache, micro controller programmers need to learn about that stuff and think carefully about how they write their variable access.
I learned about it in 2007 on Blackfin and I found I had to completely re write my algorithms to avoiding thrashing. Then I learned to love cache and embrace it.

I try to group all my frequently access globals packed up into one cache line’s worth etc.

The use of DTCM reduces the programmers need to be so careful with cache misses.

glenenglish wrote on Tuesday, April 25, 2017:

Hmm I probably want TWO different types of PortMalloc

one for DTCM and one for everything else…

This is probably much more important for the F7 that has data cache, rather than the F4.

Like in TaskCreate() , the stack can go in DTCM so stack access doesnt have to fight DMA masters

I guess I could use the TaskCreateStatic() and supply it memory pointers.

DMA able memory must go in normal SRAM.

Hi

I know this post is old, but it came up in my search of a similar topic so I thought I would post this as it might be useful for someone.

I put static FreeRTOS objects in DTCM routinely.

I use a #if switch to set them to DTCM or axi. so it I am using a processor without any or a lot of DTCM I can easily switch. When using the STM32H7 family employing the DTCM is a bit of a no brainer.

An example is:

#define USE_DTCM_RAM_FOR_WIFI_TASK 1

/* RTOS Objects */

#define STACK_SIZE_START_WIFI_TASK 512
#define STACK_SIZE_WIFI_TASK 512

#if(USE_DTCM_RAM_FOR_WIFI_TASK == 1)
StackType_t 	__attribute__((section (".DTCM_MISC"))) startWiFiTaskStack[STACK_SIZE_START_WIFI_TASK];
StaticTask_t 	__attribute__((section (".DTCM_MISC"))) startWiFiTskBuffer = {NULL};

StackType_t 	__attribute__((section (".DTCM_MISC"))) wiFiTaskStack[STACK_SIZE_WIFI_TASK];
StaticTask_t 	__attribute__((section (".DTCM_MISC"))) wiFiTskBuffer = {NULL};

TaskHandle_t 	__attribute__((section (".DTCM_MISC"))) startWiFiTskHandle = {NULL};
TaskHandle_t 	__attribute__((section (".DTCM_MISC"))) wiFiTskHandle = {NULL};

SemaphoreHandle_t __attribute__((section (".DTCM_MISC"))) wiFiReadySemHandle = {NULL};
StaticSemaphore_t __attribute__((section (".DTCM_MISC"))) wiFiReadySemBuff = {NULL};

volatile bool 	__attribute__((section (".DTCM_MISC"))) validWiFiRouterConn = false;
#else

StackType_t 	startWiFiTaskStack[STACK_SIZE_START_WIFI_TASK];
StaticTask_t  startWiFiTskBuffer = {NULL};


StackType_t 	wiFiTaskStack[STACK_SIZE_WIFI_TASK];
StaticTask_t 	wiFiTskBuffer = {NULL};


TaskHandle_t 			startWiFiTskHandle = {NULL};
TaskHandle_t 			wiFiTskHandle = {NULL};

SemaphoreHandle_t	wiFiReadySemHandle = {NULL};
StaticSemaphore_t wiFiReadySemBuff;

volatile bool validWiFiRouterConn = false;
#endif

As you can see I also pop other variables in: validWiFiRouterConn

The linker script:

/* Memories definition */
MEMORY
{
  FIR_FILT_DW    		(xrw)    : ORIGIN = 0x20000000,   LENGTH =     4K	
  DTCM_MISC				(xrw)    : ORIGIN = 0x20001000,   LENGTH =   120K	
    
  DTC_STACK    			(xrw)    : ORIGIN = 0x20010000,   LENGTH =     4K


...
  	.DTCM_MISC : {
  . = ALIGN(4);
  _sDTCM_MISC = .;
  __sDTCM_MISC__ = _sDTCM_MISC;
  *(.DTCM_MISC)
  	. = ALIGN(4);
  _eDTCM_MISC = .;
  __eDTCM_MISC__ = _eDTCM_MISC;
  } >DTCM_MISC	

You have to be careful with initialisation of the DTCM and make sure you set the variables to zero explicitly, or add a zero fill in your start up. I didn’t do this and I ended up with a Hard fault: Silly me. i.e.

/* Start and end of miscellaneous DTCM */
.word  _sDTCM_MISC
.word  _eDTCM_MISC

 /* Do next segment */
  ldr  r2, = _sDTCM_MISC
  b  LoopFill_DTCM

/* Zero fill the DTCM segment. */
Fill_DTCM:
  movs  r3, #0
  str  r3, [r2], #4

LoopFill_DTCM:
  ldr  r3, = _eDTCM_MISC
  cmp  r2, r3
  bcc  Fill_DTCM

  /* Do next segment */
  ldr  r2, = _sbss
  b  LoopFillZerobss

/* Zero fill the bss segment. */
FillZerobss:
  movs  r3, #0
  str  r3, [r2], #4

I hope someone finds this useful.

I’m not sure that it speeds things up all that much. I haven’t tested it, but it seems a waste not to use the DTCM as it doesn’t load up memory cache and frees up some of the AXI ram. The STM32H7B3I has 128 k of DTCM and it costs nothing to use it.

What’s not to like?

Regards
Rob

An alternative to repeating the declarations would be to define a #define symbol to either be attribute((section(".DTCN_MISC"))) or to be just blank, and use that on each of the items.

Not sure if putting the Task block in DTCM will help that much (but if available, why not) but putting stack there should be a big help for most programs.

Yes that would be a better way. I’ll do that in future. I did put the stacks in.

There is also pvPortMallocStack() that enables you to allocate stacks from somewhere other than the heap to enable their placement in fast memory: FreeRTOS-Kernel/tasks.c at main · FreeRTOS/FreeRTOS-Kernel · GitHub