FreeRTOS vs Bare-metal comparision STM32

gogus wrote on Sunday, November 18, 2018:

Hello.
I am trying to compare efficiency of program written in bare-metal and based on FreeRTOS. While doing tests I noticed that functions called from tasks are executed faster than that called not from task (called before the scheduler starts).
I use STM32F429I-DISCO1 board with STM32F429ZI MCU. FreeRTOS’s configuration is generated by CubeMX. I use arm-atollic-eabi-gcc compiller which is given with Attolic TrueStudio. I turned off any optimization (-O0 flag). FreeRTOS is in version 9.0.0.
I measure function’s execution time with processor cycle count DWT->CYCCNT.

Here is my test code. I removed unnecessary code generated by CubeMX.

void foo()
{
	DWT->CYCCNT = 0;
	for(uint32_t i = 0 ;i<60000; i++){
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
	}
	printf("%d\r\n", DWT->CYCCNT);
}

void task(void* param){
	foo();
	while(1){};
}

int main(void)
{
	CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
	DWT->CTRL &= ~0x00000001;
	DWT->CTRL |= 0x00000001;
	DWT->CYCCNT = 0;

	foo();
	xTaskCreate(task, "task", 100, NULL, 3, NULL);
    vTaskStartScheduler();
    while(1){};
  }

For function foo() called from main() i get execution equal 1440909 cpu cycles.
For function foo() called from task i get execution equal 1333743 cpu cycles.

Do you have any ideas why that difference is?

PS. Sorry for my English.

rtel wrote on Sunday, November 18, 2018:

Can’t say why you notice the difference you do - but I will say what you
are doing is measuring the performance of the MCU and in no way shows
what the efficiency is of a program executing bare metal versus the same
program running in an RTOS. For example, if you are using bare metal
then you will probably be executing state machines that waste time
looking to see if a state has changed or not, or possibly polling inputs
when the input has not changed. Using the RTOS (any RTOS) you can
create a completely event driven system that enables the scheduler to
allocate CPU time to a task only when the task actually has useful work
to do (no state machines, no polling, etc.) so can get a LOT more work
done in the same number of CPU cycles - and squeeze a lot more
functionality onto the same sized MCU.

ldb wrote on Sunday, November 18, 2018:

At a guess your printf does a malloc of the internal buffer the first time its called … AKA it has nothing to do with FreeRTOS or the tasks

try two foo’s one after each other at start

foo();
foo();

I am going to guess the first one is slow, the second will match the one called from the task :slight_smile:

gogus wrote on Sunday, November 18, 2018:

Thank you for the reply. The result is the same - function called from task execute faster. No matter how many times I call foo not in task ;/
To be sure that printf function is not the reason of additional cpu cycles I saved DWT->CYCCNT to variable before calling printf.

gogus wrote on Sunday, November 18, 2018:

Thank you for the quick reply :slight_smile:
Of course, when I use polling, the cpu utilization will be much higher. My bare-metal program is event and interrupt driven so it should be more efficiency than that with freertos, shouldn’t it? With RTOS cpu has more work to do e.g. context switch.

The code included above is simple example of the problem. Of cource, the results are real for this example.

richarddamon wrote on Sunday, November 18, 2018:

As to why the function runs faster in task than in main, I can think of a couple of reasons. One is that the FreeRTOS Start Scheduler function does some hardware initilization, and that might change some of processor speed. It might have turned on a hardware cache, and the task stack might be in faster ram than the main function if your processor has external ram.

As to which system will be more efficient, as many things the answer will be “It Depends”. A home-grown bare metal system may be slightly more efficient in switching between operations, but a context switch actually is fairly cheap, not much more than the cost of an ISR entry. The biggest difference will be that a system like FreeRTOS will tend to give better performance in response time as you will put less code in the ISR, and the tasks can be preemptive, rather than having to wait for the current operation to reach a decision point. The need for the bare-metal system to have frequent stop point may make the RTOS based code more efficient.

gogus wrote on Sunday, November 18, 2018:

Thank you. You are probably right becouse when I test single operation like ‘i++’ inside and outside the task the difference is about 1 cpu cycle so it looks like processor needs more time to fetch/store variable.