STM32H7 mixing manual and automatic placed variables at sections

manu · November 22, 2020, 6:21pm

Hi,

I have a few different memory regions on a STM32H7. As a result, heap_5.c is used for managing the heap.

That said, I have some variables that need to be placed manually at a specific region.
When I do that, I get a conflict between the automatic allocation and the manual allocation.

The only option that I can think of is to use that entire region with manual allocation.
But If manual allocation is avoidable, that looks much better.

Is that possible ?

For example, my scatter file looks thus:

LR_IROM1 0x08000000 0x00200000  {    ; load region size_region
  ER_IROM1 0x08000000 0x00200000  {  ; load address = execution address
   *.o (RESET, +First)
   *(InRoot$$Sections)
   .ANY (+RO)
  }
  RW_IRAM2 0x24000000 0x00080000  {  ; RW data
   .ANY (+RW +ZI)
  }
  RW_DMARxDscrTab 0x30040000 0x60 {
  *(.RxDecripSection)
  }
  RW_DMATxDscrTab 0x30040060 0x140 {
  *(.TxDecripSection)
  }
  RW_Rx_Buffb 0x30040200 0x1800 {
  *(.RxArraySection)
  }
}

My heap initialization looks thus:

#define __ram(__addr)			__attribute__((at(__addr)))

/**
 * 0x38800000 - 0x38800FFF (   4096) Backup SRAM
 * 0x38000000 - 0x3800FFFF (  65536) SRAM4
 * 0x30040000 - 0x30047FFF (  32768) SRAM3
 * 0x30020000 - 0x3003FFFF ( 262144) SRAM2
 * 0x30000000 - 0x3001FFFF ( 131072) SRAM1
 * 0x24000000 - 0x2407FFFF ( 524288) AXI SRAM
 * 0x20000000 - 0x2001FFFF ( 131072) DTCM
 * 0x1FF00000 - 0x1FF1FFFF ( 131072) System Memory
 * 0x08100000 - 0x081FFFFF (1048576) Flash memory bank 2
 * 0x08000000 - 0x080FFFFF (1048576) Flash memory bank 1
 * 0x00000000 - 0x0000FFFF (  65536) ITCM
 */
#define MEMSIZ(__start, __end)		((__end - __start) + 1)
	
#define SRAM_AXI			(uint32_t) 0x24000000
#define SRAM_AXI_END			(uint32_t) 0x2407ffff
#define SRAM_AXI_SIZ			MEMSIZ(SRAM_AXI, SRAM_AXI_END)
	
#define SRAM_1				(uint32_t) 0x30000000
#define SRAM_1_END			(uint32_t) 0x3001ffff
#define SRAM_1_SIZ			MEMSIZ(SRAM_1, SRAM_1_END)

#define SRAM_2				(uint32_t) 0x30020000
#define SRAM_2_END			(uint32_t) 0x3003ffff
#define SRAM_2_SIZ			MEMSIZ(SRAM_2, SRAM_2_END)

#define SRAM_3				(uint32_t) 0x30040000
#define SRAM_3_END			(uint32_t) 0x30047fff
#define SRAM_3_SIZ			MEMSIZ(SRAM_3, SRAM_3_END)

#define SRAM_4				(uint32_t) 0x38000000
#define SRAM_4_END			(uint32_t) 0x3800ffff
#define SRAM_4_SIZ			MEMSIZ(SRAM_4, SRAM_4_END)

#define SRAM_RTC			(uint32_t) 0x38800000
#define SRAM_RTC_END			(uint32_t) 0x38800fff
#define SRAM_RTC_SIZ			MEMSIZ(SRAM_RTC, SRAM_RTC_END)

static void init_heap(void)
{
	static uint8_t heap_1[SRAM_AXI_SIZ] __ram(SRAM_AXI);	/* SRAM_AXI */
	static uint8_t heap_2[SRAM_1_SIZ]   __ram(SRAM_1);	/* SRAM_1 */
	static uint8_t heap_3[SRAM_2_SIZ]   __ram(SRAM_2);	/* SRAM_2 */
	static uint8_t heap_4[SRAM_3_SIZ]   __ram(SRAM_3);	/* SRAM_3 */
	static uint8_t heap_5[SRAM_4_SIZ]   __ram(SRAM_4);	/* SRAM_4 */

	HeapRegion_t xHeap[] = {
		{(uint8_t *) heap_1, sizeof (heap_1)},		/* SRAM_AXI */
		{(uint8_t *) heap_2, sizeof (heap_2)},		/* SRAM_1 */
		{(uint8_t *) heap_3, sizeof (heap_3)},		/* SRAM_2 */
		{(uint8_t *) heap_4, sizeof (heap_4)},		/* SRAM_3 */
		{(uint8_t *) heap_5, sizeof (heap_5)},		/* SRAM_4 */
		{NULL, 0}					/* Terminate Array */
	};
	vPortDefineHeapRegions(xHeap);
}

and the manual allocation looks thus:

__attribute__((section(".RxDecripSection"))) ETH_DMADescTypeDef  DMARxDscrTab[ETH_RX_DESC_CNT]; /* Ethernet Rx DMA Descriptors */
__attribute__((section(".TxDecripSection"))) ETH_DMADescTypeDef  DMATxDscrTab[ETH_TX_DESC_CNT]; /* Ethernet Tx DMA Descriptors */
__attribute__((section(".RxArraySection"))) uint8_t Rx_Buff[ETH_RX_DESC_CNT][ETH_RX_BUFFER_SIZE]; /* Ethernet Receive Buffer */

How to handle this mixed application in a graceful manner ?
(If possible, I would like to do away with the manual allocation itself. Is that the right thing to do ?)

Any thoughts/suggestions ?

Thanks,

Manu

rtel · November 22, 2020, 11:08pm

Just looking at SRAM3 I see you are providing the whole SRAM size to the heap:

tatic uint8_t heap_4[SRAM_3_SIZ]   __ram(SRAM_3);	/* SRAM_3 */

while simultaneously trying to place the DMARxDscrTab array at the same address:

__attribute__((section(".RxDecripSection"))) ETH_DMADescTypeDef  DMARxDscrTab[ETH_RX_DESC_CNT]; /* Ethernet Rx DMA Descriptors */

I would expect to get a linker error from this. Assuming there is a hardware restriction that means you must have the descriptors at that address then don’t allocate that memory to the heap.

manu · November 23, 2020, 7:34am

Hi Richard,

Thanks for the reply.

Yes, there’s a linker error. Actually 2 in fact.
STM32H743ZI-Nucleo\STM32H743ZI-Nucleo.axf: Error: L6971E: app_ethernet.o(.data) type RW incompatible with main.o(.ARM.__AT_0x24000000) type ZI in er RW_IRAM2.

STM32H743ZI-Nucleo\STM32H743ZI-Nucleo.axf: Error: L6985E: Unable to automatically place AT section main.o(.ARM.__AT_0x30040000) with required base address 0x30040000. Please manually place in the scatter file using the --no_autoat option.

But the error is not the issue at hand. I have been trying to use as much little memory for the stack and provide the rest for the heap. But, anyways that’s not what I was really looking at.

It is true that the variables allocated to a specific region, due to hardware restriction. Indeed, yes.

The ethernet peripheral is in a Domain that can be connected directly to either SRAM1, SRAM2 or SRAM3 (32bit) regions. It maybe put in other regions, but that comes with a penalty. For some, the penalty might not be an issue, but in the larger picture it is. Mapping the peripheral to other regions, interferes with another peripheral operation.
For eg: If I am to use the framebuffer region for the ethernet buffer, eventually where should the frambuffer be located … ? Not only that, there will contention over the bridges between the various domains (That would be a bad design decision, which makes more sense to map the Ethernet peripheral buffers/descriptors to SRAM1/2/3 only and not otherwise).

My question centers around this issue, but that’s not what I mean to ask in my original post.

As per the logic outlined, my intention to keep the ethernet related buffers, descriptors in SRAM1 (128k)/SRAM2 (256k)or SRAM3 (32k), for having the memory within the same Domain. If I manually allocate the buffers to SRAM1, all 128k will need to be manually allocated. The same goes for SRAM2 or SRAM3. By design choice, the vendor recommends to use SRAM3, but 32k could be slightly be smaller in certain situations, eg, with a TLS handshake, unless and otherwise the handshake itself needs to be fragmented (which comes with a performance penalty.). Another thought would be to put the buffer into AXI_SRAM (64bit), which is the brain-dead option. This not only uses the framebuffer region, but also causes “unnecessary” bus traffic. The Graphics primitives themselves will be causing significant bus transactions. So, the SDMMC, Ethernet, LCD and the CPU are all going to fight for access to the AXI bus (If you have DRAM on the PCB, add that also to the mix) … (This is an ugly design choice)

My post arises from the thought, whether the annotation to a memory region can in some way be used for manual placement of variables, while using automatic placement.

For example in FreeRTOSv202011.00\FreeRTOS\Demo\CORTEX_M0+_LPC51U68_GCC_IAR_KEIL\app

	/* Place the first block of the heap memory in the first bank of RAM. */
	static uint8_t ucHeap1[ configTOTAL_HEAP_SIZE ];

	/* Place the second block of the heap memory in the second bank of RAM. */
	static uint8_t ucHeap2[ 16 * 1024 ] COMPILER_ATTRIBUTE_PLACE_IN_2ND_MEMORY_BANK;

I see that the annotation helps in specific placement, with the linker, I presume.

Keeping that in mind, wondered whether the allocation can be crafted in such a way that the variables can be placed similarly ? (I am trying to avoid the conflict with the auto and manual placements. Even if that’s addressed, there is yet the other part, how to tell malloc, which memory bank to choose from. )

If that’s not possible, the only other option which makes really sense is to allow the entire SRAM3 available for ethernet buffers/descriptors. (But, I would like to see it only as the low hanging fruit)

Thanks,
Manu

htibosch · November 23, 2020, 9:05am

Manu, that’s lot of text, and a lot of it has to do with STM32H memories and their properties, not with FreeRTOS.

Did you make sure that you add the memories in a sorted order? Each memory chunk should have a higher address than the previous ones.

In my testing project for the Ethernet driver, I ended up using part of AXI_RAM at address 0x24000000 for the Ethernet. That worked perfectly, but I must admit that I didn’t use any other peripheral at the same time.

Please find the latest STM32H driver here.
It uses section( ".ethernet_data" ) for all Ethernet DMA buffers and DMA descriptors.

Can you describe what you expect of the Ethernet? Will you use many TCP and/or UDP sockets? Do you need high-volume TCP connections? That all determines the number of DMA- and network buffers. Also the size of the TCP sliding windows influences the RAM usage ( non-DMA though ).

manu · November 23, 2020, 10:49am

Hi Hein,

Greetings!

When you drive a car, you would need to know which country you are driving in, the details of the car and the laws of the road of the land you are driving in. I am not saying nothing bad is going to happen, but that results in rather really poor performance, from my view.

Similarly, the OS is restricted or runs on the designated hardware and therefore some knowledge of hardware is necessary to make a judgement. A better and concise depiction was necessary, I felt and hence the details. Everyone makes wrong decisions. for eg: if the lwip project had chosen a multi-threaded model, probably everyone would have not bothered to try anything else. Different people find it difficult to apply those decisions for their requirements and so on, hence people moved on … Depicted that example, such that it would be better evident, since you were also bitten by those issues.
When we talk about the kernel, it’s mostly hardware and memory isn’t it ?

In short: What you’ve done is, put D2 Domain inside D1 Domain. By mixing both domains, you are neither getting better performance, nor security. Your ether peripheral is getting better performance, but at the cost of others.

Citing what the RM says:

2.1.1 Bus matrices
AXI bus matrix in D1 domain
The D1 domain multi AXI bus matrix ensures and arbitrates concurrent accesses from
multiple masters to multiple slaves. This allows efficient simultaneous operation of highspeed peripherals.
The arbitration uses a round-robin algorithm with QoS capability.
Refer to Section 2.2: AXI interconnect matrix (AXIM) for more information on AXI
interconnect.

AHB bus matrices in D2 and D3 domains
The AHB bus matrices in D2 and D3 domains ensure and arbitrate concurrent accesses
from multiple masters to multiple slaves. This allows efficient simultaneous operation of
high-speed peripherals.
The arbitration uses a round-robin algorithm.

What was the whole point of having different domains, altogether ?
(A human being has legs, hands and so on. There needs this balance to exist between them! You cant simply state that you like your right hand better, because you type code with the right hand. )

I have the ethernet driver from:

FreeRTOSv202011.00\FreeRTOS-Plus\Source\FreeRTOS-Plus-TCP\portable\NetworkInterface\STM32Hxx

which I guess is almost the same.

From my view, there can be 2 use cases for the Ethernet with the H7 MCU.

A traditional MCU with Ethernet doing low bandwidth operations, similar to most small IoT devices using TLS. (Latency would be an issue in this case, not the bandwidth) As days pass by TLS does become mandatory, as laws enacted in various countries regarding security for such devices.
These devices do not need the high volume connectivity. (ST themselves considers the ethernet peripheral as a low performance peripheral, hence it exists in the D2 domain. Otherwise it would have been in the D1 domain.) (If peripherals in the D1 domain were stalled, that would have a visible impact in terms of display activity and or any file access.) Even if have high bandwidth activity, where will you store the data. Even if you use the additional DRAM, the DRAM size support is also quite small.
A MCU with an external DRAM application, which can deal with large chunks of data. (The high speed ethernet application applies likely only in this scenario). This is a very small segment, as anyone who would like to go with this situation will go for a larger CPU, Memory and so on. Cost wise also, a bigger CPU would be better due to the lower DRAM-CPU compatibility issue. Lesser dwellers in this application scenario.

My current use case is around use case 1 as of now. It deals with lesser number of buffers, hence I was thinking about the SRAM_3 “low hanging fruit” option. The application also requires a display and so on. RAM is one area that one would like to have more of in all situations, hence trying to optimize the usage. You need quite a bit of memory for drawing too, much more than the ethernet stack.

Thanks,
Manu

richard-damon · November 23, 2020, 12:05pm

I think I understand you problem, and I think the simple answer is that these 3 regions, while in hardware fixed, are not required to be defined in the linker control file exactly like this. You could split one of them, like SRAM3 into two blocks, SRAM3A and SRAM3B, and have SRAM3A hold all the things that you need fixed addresses for, and SRAM3B be the stuff that can just be auto assigned.

manu · November 23, 2020, 12:50pm

Nearly, Yes. The vendor does exactly that. (SRAM_3) 32k is more than required, if one uses plain communication. But if one uses TLS for eg, 32k is not enough.
Penalty added by placing that region, anywhere else other than SRAM1/2/3.

I wondered initially whether it was possible to do away with the “split/partition” as termed, use auto allocation, “somehow” get region specific memory (something a-la GFP_DMA) ? (But that needs the allocator to be also aware of the region specifics)
But, I guess it is too far fetched, after the discussion.

Settling with either SRAM1/3 “split/partitioned/sectioned” for the ethernet stuff, would be the only sane possible option, AFAICT. SRAM2 being larger and contiguous could be useful, as it is.

Thanks,
Manu