Performance Issues using newer FreeRTOS V10.4.6 ,10.5.0 with TMS570LC4357

Hi,
We are using FreeRTOS on a TMS570CL435. This processor is arm R5-based. As there is no FreeRTOS port for this we are writing our own based on the R4 port and the provided HalCoGen code. As HalCoGen only supports older FreeRTOS versions we had to adapt it ourselves.
Currently, we are upgrading from version 10.4.5. to the latest version in steps. There we have recognized a significant performance loss by measuring the idle time.

Version IDLE %
10.4.5 44%
10.4.6 40%
10.5.0 33%

By tracking down this Problem, we noticed one change in the portasm.asm and portmacro.h files.

FreeRTOS v10.4.5. portasm.asm:

; start: required for Cortex-R5 MPU port - generated by TI HALCoGen - see src/os/freertos/README.ti-halcogen.md for details

;-------------------------------------------------------------------------------

.def ulPortCountLeadingZeros

.asmfunc

ulPortCountLeadingZeros

CLZ R0, R0

BX LR

.endasmfunc

FreeRTOS v10.4.5. portmacro.h:

/* Generic helper function. */

unsigned long ulPortCountLeadingZeros( unsigned long ulBitmap );

#define portGET_HIGHEST_PRIORITY( uxTopPriority, uxReadyPriorities ) uxTopPriority = ( 31 - ulPortCountLeadingZeros( ( uxReadyPriorities ) ) )

FreeRTOS v10.4.6. portasm.asm:

; Function removed

FreeRTOS v10.4.6. portmacro.h:

#define portGET_HIGHEST_PRIORITY( uxTopPriority, uxReadyPriorities ) uxTopPriority = ( 31 - __clz( ( uxReadyPriorities ) ) )

When we just put back the assembler code without calling it, the performance is back at the value of V10.4.5. However, when integrating this also in V10.5.0 the performance worsens. We are using the TI-CGT 20.2.7 compiler and default with -o3 optimization. With no compiler optimizations to -o0, the difference between V10.4.5 and V10.4.6 disappears, but the problem with V10.5.0 remains.
In version V10.5.0, FreeRTOS introduces an if statement in mpu_wrappers.c to check whether the we are in the privileged mode or not.
Before this query was inside the assembler code of swiRaisePrivilege which made it more efficient. Has someone else encountered the same problems, or does anyone know the reason for the change in mpu_wrappers?

Kind regards and thank you,
Sven

Several changes have been made overtime in response to security bug reports. The one you are pointing is likely one of those. Is this check the reason of the performance drop?

Is my understanding correct that there is a 4% performance gap when you inline the clz instruction instead of making a function call?

Yes. It can not explain the performance loss fully, but at least a significant part of it.

Also yes. This happens when compiler optimization is set to -o3. In -o0 is not much of a difference.

Just to confirm if you make this change, is the performance closer to the original? Can you also share the change?

Yes than we have a idle task execution of 38%

Would you please share the changes you made?

This is the diff that improved V10.5.0 from 33% idle to 38% idle:

The code you show seems like a modified version of mpu_wrappers.c we provide - FreeRTOS-Kernel/portable/Common/mpu_wrappers.c at main · FreeRTOS/FreeRTOS-Kernel · GitHub. This declaration should likely go in portmacro.h so that you do not need to modify mpu_wrappers.c. The other changes are in port files and those should be okay. Note that we do not provide this port, so you should inform TI about your updates so that they can incorporate those.

Thank you for sharing your investigations and solutions!

Alright. Thank you for your answer. So you don’t have an explanation why it is getting slower or have you noticed this behavior on other ports?

Sorry I should have been more clear in my response. I think the cause is changing from macro to function call. The changes you are making seems right to me - you’ll need to follow up with the vendor to get these changes in their port.

You mean the portRESET_PRIVILEGE() call, right? Yes, you are right. Changing that back improves the performance a bit. But the significant part of the performance loss still remains.

I think I am missing something here. I thought you made some changes which brought the performance back to previous. Is that understanding not correct?

No, the changes increased the idle time from 33% to 38%. But I need to get to 44%.

You might try disassembling the V10.5.0 build to double-check if the compiler is inlining the CLZ instruction or falling back to the older instruction set(which might add a function call).

Another idea would be to run both builds under a profiler and look for major differences there.

Then we need to profile and see which part is consuming time.