Task switching timings

utemkin wrote on Thursday, May 17, 2018:

Hi!

I’m new to FreeRTOS and am trying to understand if I did something wrong…

I have STM32F103 MCU at 72MHz. The compiler is GCC (
arm-none-eabi-gcc.exe (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204] )
For timings I use DWT->CYCCNT, which gives count of HCLK clocks since MCU reset. It gives perfect 100% repeatable result.

I have two tasks. The first is idle (idle) task created by FreeRTOS with low priority. The second is normal priority task (my) created before scheduler started. So after scheduler is started, my task gets run because it has higher priority and ilde task has no chance to run before I block my task.

I put reading of DWT->CYCCNT in vApplicationIdleHook and then just hang it in endless loop.

I do xSemaphoreTake of not given binary semaphore (wherefore it blocks my task and forces context switch to idle task). I also put reading of DWT->CYCCNT just before xSemaphoreTake.

The two readings differ by 1130 CPU clocks.

As I understand, the following happens:

  1. xSemaphoreTake checks the state of semaphore and finds it is not given
  2. empty list inside of semaphore is appended with ID of my task
  3. scheduler finds next task to unblock (it’s idle task)
  4. scheduler sets PendSV bit
  5. PendSV handler switches the task context

Is it normal for FreeRTOS to spend that much time doing what I listed? Do I miss something?

BTW I measured how much portYIELD() takes to switch back to the same task (as no other task is scheduled). It is 170 clocks.

rtel wrote on Thursday, May 17, 2018:

BTW I measured how much portYIELD() takes to switch back to the same
task (as no other task is scheduled). It is 170 clocks.

That sounds about right. The numbers here are for the context switch
only: FreeRTOS FAQ relating to FreeRTOS memory management and usage. FreeRTOS is an Open Source RTOS Kernel for small embedded systems and, although
shorter than your measured time (84 clocks), that will probably just be
due to different configurations.

Regarding the semaphore times. The semaphores are relatively heavy
weight objects as they are feature rich - you can have any number of
tasks blocked to give, any number blocked to take, and the blocked tasks
are in priority order, etc. The yield, which you listed separately,
occurs within that too. In most cases a direct to task notification can
be used for a much lighter weight and faster signalling mechanism.

utemkin wrote on Friday, May 18, 2018:

That sounds about right. The numbers here are for the context switch
only: FreeRTOS FAQ relating to FreeRTOS memory management and usage. FreeRTOS is an Open Source RTOS Kernel for small embedded systems and, although
shorter than your measured time (84 clocks), that will probably just be
due to different configurations.

I agree. 170 clocks seem reasonable. 84 should be added with

  1. set PENDSV bit ~5 clocks
  2. enter/leave interrupt 12+12 clocks
  3. vTaskSwitchContext execution

About direct task notifications:
I substituted xSemaphoreTake with ulTaskNotifyTake(pdTRUE, portMAX_DELAY)
Now it takes 432 clocks. This also seems reasonable.

But still… What is the difference between semaphore and notification in my case?

From what I already described

  1. xSemaphoreTake checks the state of semaphore and finds it is not given
  1. empty list inside of semaphore is appended with ID of my task
  2. scheduler finds next task to unblock (it’s idle task)
  3. scheduler sets PendSV bit
  4. PendSV handler switches the task context

only p2 differs. And that takes 1130 - 432 ~ 700 clocks!

BTW The code is compiled with -o2 option and
configASSERT undefined
configUSE_PORT_OPTIMISED_TASK_SELECTION 1
trace and statistic facilities turned off

rtel wrote on Friday, May 18, 2018:

The semaphore includes much more logic as it can be used by multiple
tasks at once. For example, consider the case where a task is unblocked
because a semaphore becomes available, but before the task can run, a
higher priority task takes the semaphore again. Now, when the unblocked
task does get a chance to run, it finds the semaphore unavailable, has
to recalculate its block time to take into account the length of time
that passed since the function was called, and then re-block. I know
that is not the path your code is taking, but you still need to go
through the if()/else()/loop calls to determine that.

utemkin wrote on Friday, May 18, 2018:

You described a really strange case. I would say that if ownership of samaphore’s ‘coin’ already assigned to low priority task, it should not be taken back. This mechanism is very much like Cortex-M Late-arriving feature, but with big difference. Late-arriving is just optimization, which prevents MCU from doing unnecessary work. But your mechanism changes sequence of events. I understand it tries to make as much as possible to allow higher priority task to work, but It still seems strange to me.

Anyway, thanks for your explanation and suggestion. I’ll try to implement synchronization by direct task notifications.

richard_damon wrote on Friday, May 18, 2018:

The way to think about it is that when the semaphore was given, the highest priority task waiting on the semaphore is woken because the semaphore is AVAILABLE, not that it has been given the semaphore. the task still needs to run to actually take the semaphore. This says that the take always takes place in the context of the task requesting the semaphore (which might simplify some code, especially tracing code). This means that if the task being woken wasn’t the highest priority ready task, some other higher priority task has the chance to take the semaphore before the originally woken task.

The giving task doesn’t give the coin to the other task, it just gives it back to the semaphore, so it it ready to be taken.

This is a very different context than the interrupt case you metioned. In the interrupt case, the low priority interrupt was already higher priority then the execution context (or it wouldn’t occur), while in the semaphore give case, the give was likely waking a task of a lower priority then itself (otherwise it would have immediately been changed to and taken the semaphore).

utemkin wrote on Friday, May 18, 2018:

Richard Damon, I understand the concept. I’m just wondering why it is realized in FreeRTOS. Giving the coin to task in kernel mode would slow down the kernel part by several CPU clocks, but would free user mode part of several hundred clocks. Why this trade-off was chosen?

And I understand that semaphore is unnecessary complex for interrupt case. As I understand, the only alternative to it is direct task notifications, but they have negative consequences. They force you to design wating part of code tightly bound for different components. This is very much like socket reactor concept for Berkley sockets. The key part of it is the code around single place with select().

rtel wrote on Friday, May 18, 2018:

Richard, I understand the concept. I’m just wondering why it is realized
in FreeRTOS. Giving the coin to task in kernel mode would slow down the
kernel part by several CPU clocks, but would free user mode part of
several hundred clocks. Why this trade-off was chosen?

…because it is the most true to the scheduling policy and reduces the
risk of unbounded priority inversion.

richard_damon wrote on Friday, May 18, 2018:

My feeling on this is that the current system is likely the simplest. The sequence that happens when you give a semaphore and when you take a semaphore are always uniform. The give operation doesn’t need to do different things to the semaphore based on the presence of a task waiting on it. (It does do some additional steps to the highest priority task if there is on, but its actions on the semaphore are uniform). The semaphore is always aquired by the same sequence of code. This make it easier to ‘prove’ that the code is correct, and since one variant of the code (SafeRTOS) has gone through a certification, being able to do that is useful.

You also talk about ‘kernal mode’ and ‘user mode’, but there is no distinction. Most uses of FreeRTOS are on simple processors which don’t have any such distinction. Even if run on a processor with protection, the entire semaphore code would be run in protected mode.

utemkin wrote on Monday, May 21, 2018:

My feeling on this is that the current system is likely the simplest.

Got it. Thanks. Reliability things are expensive.

You also talk about ‘kernal mode’ and ‘user mode’, but there is no distinction.

I think I messed with the therms… by ‘kernel mode’ I meant mode when all OS objects are blocked from modification and interrupts are disabled. And also I’ve read somewhere on freertos.org that there is a policy to not have loops of unpredictable size in kernel mode, which is acceptable in user mode when interrupts are enabled.

I’ve made another experiment with direct task notifications.

The scenario is:

  1. Start hardware operation
  2. Make timestamp 0. Call xTaskNotifyWait
  3. Make timestamp 1 upon entry of idle thread
  4. Get interrupt from hardware. Make timestamp 2. Call xTaskGenericNotifyFromISR
  5. Make timestamp 3 upon return from xTaskGenericNotifyFromISR
  6. Make timestamp 4 upon return from xTaskNotifyWait

Results:

  1. timestamp 0 - timestamp 1 459 clocks
  2. timestamp 2 - timestamp 3 238 clocks
  3. timestamp 3 - timestamp 4 275 clocks

So this sums to 972 clocks and involves:

  1. Marking current task as waiting for notification
  2. Finding highest priority non-blocked task. Set PentSV bit
  3. Context switch (170 clocks)
  4. Compare notified task’s priority to current task’s. Set PentSV bit
  5. Context switch (170 clocks)

So there’s still something I don’t understand.

  1. Why p1 + p2 takes 289 clocks? Especially considering port-optimized task switch which finds non-blocked task very fast?
  2. Why p4 takes 238 clocks? The task to wake is already known…
  3. And why 100 clocks added to context switch here?

Is there any other mechanism for interrupt scenario in FreeRTOS with less overhead?

BTW I’m stuck with FreeRTOS 9.0,0 because it is the latest in Stm32CubeMx. Is there any imrovements in v10 regarding performance?