Use MPU to detect stack overflow on Arm Cortex M4 and M7?

Our software runs on a range of ARM Cortex M0+, M4F and M7 processors. A perennial issue we have is not knowing how much stack each task needs. We have stack overflow checking enabled in FreeRTOS but frequently it doesn’t detect a stack overflow before it causes a crash.

We use FreeRTOS though our own C++ wrapper that places each task stack immediately after the corresponding Task object. This means that when the stack overflows, it usually corrupts a Task object. That way, we at least know which task had the stack overflow.

Most of the processors we use have the MPU option; so I am wondering whether we could use the MPU to protect the Task object, and hence guard against the stack overflowing into it. On the ports for which we don’t already use the MPU, we could do this by making all memory accessible in the MPU, then overriding it by adding a memory region with a higher number that declares the memory for the currently-executing task as read-only when the CPU is in User mode. We would need to switch to the MPU-enabled configuration of FreeRTOS, and have that MPU region modified when the task switch occurs. We would also need to align each task object on a suitable (128-byte aligned) boundary for the MPU.

I’ve never use the MPU-enabled ports of FreeRTOS before, so my question is: would this be straightforward to implement? Has anyone done it before? Does FreeRTOS switch out of user mode on every call that may need to write to the current Task object? Does this switch add much overhead?

I am aware that some more recent Arm Cortex processors have a stack limit register; but sadly we have to support older processors, also there doesn’t appear to be the equivalent of a Cortex M7 processor with the stack limit register.

I’ve researched the MPU support and it appears to me that a better solution would be to put all the Task objects in protected memory and the stacks in another section of protected memory, since FreeRTOS has the ability to allow a task access to its own stack when it is swapped in. However, a limitation of this approach is that each stack must be a power of 2 in size and alignment. This means that if, for example, I determine that a 2kb stack is slightly too small for a task, I am forced to use a 4kb stack. Some of our firmware configurations are too tight on RAM usage to be this wasteful - we try to optimise stack sizes to within 200 bytes. So I may need to modify FreeRTOS to allow the stack to be described by a small number (perhaps a maximum of three) memory descriptors instead of just one.

I have somewhat of a similar problem, in that I need to adjust task sizes to keep the program from crashing. My memory situation is opposite, in that I have sufficient memory because I added a QSPI memory to all of my designs. (I am only using L5 and U5 processors, which have that interface.) My programming is in C++, and I have a callable routine that sets up all the drivers and queues the system uses, then creates the main application task (which creates application specific chip drivers and various tasks).

This won’t solve your problem, but it can give you another handle on what’s going on.

What I did was to add a statistics option, which is conditionally enabled. It reports FreeRTOS stack every time it’s called, along with the filename and location of the call. It stores to the memory segment used for trace data (which I don’t use at this time).
The call looks like:

		#ifdef _STATISTICS
			statistics_tag(__FILE__, __LINE__, (char*) "Beginning of APPLICATION CREATE");
		#endif

while the routine itself is:

struct statistics_data STATISTICS[_MAX_STATISTICS] = {0};
//struct statistics_data __attribute__((section (".trace_data"))) STATISTICS[_MAX_STATISTICS] = {0};

#include "../COMMON_CPP/CPP_LINK.hpp"

//#ifdef _XHEAP
//	#include "../HARDWARE_CPP\xheap.hpp"
//	Xheap*				xheap;
//#endif

#ifdef __cplusplus
	extern "C"
	{
#endif

void statistics_tag(const void* filename, unsigned int linenumber, char* comment)
{
	int 				i=0;
//	BlockLink_t* 		pxBlock;
	HeapStats_t			pxHeapStats;
	TaskHandle_t		current_task;

	while ((STATISTICS[i].filename != NULL) && (i < _MAX_STATISTICS))
	{
		i++;
	}
	if (i < _MAX_STATISTICS)
	{
		//strncpy(STATISTICS[i].filename,(char*)filename,49);
		STATISTICS[i].filename = (char*)filename;
		STATISTICS[i].line_number = linenumber;
//		strncpy(STATISTICS[i].comment,comment,49);
		current_task =  xTaskGetCurrentTaskHandle();
		if (current_task != 0)
		{
			strncpy(STATISTICS[i].active_task,pcTaskGetName(current_task ),49);
		}
		else
		{
			strcpy(STATISTICS[i].active_task,(char*) "NO ACTIVE TASK");
		}
//		#ifdef _XHEAP
//			STATISTICS[i].xxheapsize = xheap->get_free();
//		#endif
		STATISTICS[i].comment = comment;
		vPortGetHeapStats(&pxHeapStats);
//		taskENTER_CRITICAL();
//		{
			STATISTICS[i].FreeRTOS_heap_free = pxHeapStats.xAvailableHeapSpaceInBytes;
			STATISTICS[i].FreeRTOS_heap_lowest = pxHeapStats.xMinimumEverFreeBytesRemaining;
			STATISTICS[i].FreeRTOS_heap_allocated = pxHeapStats.xAvailableHeapSpaceInBytes;
			if (i == 0)
			{
				STATISTICS[i].delta_from_last = 0;
			}
			else
			{
				STATISTICS[i].delta_from_last = STATISTICS[i].FreeRTOS_heap_free - STATISTICS[i-1].FreeRTOS_heap_free;
//				#ifdef _XHEAP
//				STATISTICS[i].xxheapdelta= STATISTICS[i].xxheapsize - STATISTICS[i-1].xxheapsize;
//				#endif;
			}
//		}
//		taskEXIT_CRITICAL();
	}
}


#ifdef __cplusplus
}
#endif														// END FREERTOS LOOP


and the data structure is:

	struct statistics_data
	{
		char*			filename;
		int				line_number;
		char*			comment;
		char			active_task[50];
		int				FreeRTOS_heap_free;
		int				FreeRTOS_heap_lowest;
		int				FreeRTOS_heap_allocated;
		int				delta_from_last;
		int				xxheapsize;
		int				xxheapdelta;

	};


xheap is a C++ routine that has a memory manager allowing a separate memory space in extended memory, it uses a first fit algorithm. Since it’s C++, calling it from C routines is a bit messy, so I disabled that part. I’d need to duplicate the memory statistics in C accessible variables, just didn’t get to it.

What I did, program structure wise, is run my task creation routines out side of the default task. You don’t really need it inside, and it was just chewing up memory (This is the default “see what we’re doing for you” CubeMXIDE generated task. It’s automatically created, but I use a minimal stack and have it do a 10 second delay loop. Task creation is done in the section of main labeled “FreeRTOSS queues” since there’s no labeled place to put it, but it does need to be after the oskernel init and before the call to run the tasks,

This at least gives me an idea of what is being asked of the FreeRTOS heap.
I got some interesting answers about task high water marks.

I use an external RTOS heap, put some programs in xheap memory, and am still messing with the process.

Hope this helps a bit.

Yes, that is needed as it is a hardware requirement.

You can also explore static MPU configuration and utilize MPU sub-region disable feature which allows you to divide an MPU region in 8 equal parts and enable/disable them individually. Consider having a big memory region which satisfies hardware requirements for size and start address and use 1/8th for each task stack:

+--------+--------+--------+--------+--------+--------+--------+--------+
| Stack1 | Stack2 | Stack3 | Stack4 | Stack5 | Stack6 | Stack7 | Stack8 |
+--------+--------+--------+--------+--------+--------+--------+--------+

<----------------------------------------------------------------------->
                            MPU Region

You’ll need to program MPU on each context switch to enable the stack for corresponding task. You can utilize traceTASK_SWITCHED_IN hook for that.

@maydn, thanks for replying.

I already have a stack reporter that reports the stack low water mark, by prefilling the stack with a pattern when it is created. This is what I use to establish how much stack each task needs. However, it isn’t foolproof because different machine configurations may result in slightly different stack requirements, depending on the pattern of nested function calls made.

@aggarg, thanks for replying.

Yes, that is needed as it is a hardware requirement.

Isn’t it rather a combination of the hardware requirement on MPU regions, and the decision made to use only one MPU region descriptor to describe the stack? For example, if up to 2 descriptors were used then if a 2K stack was too small I could use a 3K stack on a 1K aligned boundary, described by a 2K block followed by a 1K block or vice versa..

Thanks for the suggestion to use sub-regions, I hadn’t considered that.

I’ve realised that if I switch to the MPU build, this will have major implications for some of our code. We often disable interrupts for very short periods when accessing small non-atomic data structures shared by two or more tasks, or shared by a task and an ISR (in the latter case we sometimes just change the BASEPRI register to shut out only some interrupts). We read the Systick timer in order to implement delays of just a few hundred nanoseconds or a few microseconds. On the ARM Cortex M0+ processor we use, the standard integer division implementation in the library disables interrupts (because the hardware divide unit can’t be interrupted) and we also need to disable interrupts to implement members of std::atomic. We won’t be able to do this if we use the MPU build and run tasks in user mode.

You are right. I was just talking about the size and alignment requirement of one MPU region.

You should still be able to do that by making all the tasks privileged. You’ll need to ensure that the memory right next to the task stack is not accessible to the task.

@aggarg, thanks. I had mistakenly assumed that a privileged task had access to all memory.

You are right that a privileged task has access to all the memory as we enable background access (i.e. privileged code can access memory areas not covered by any MPU region). You can limit access to the memory right after the task stack explicitly by marking it read only using a task configurable MPU region.