memcpy error under GCC optimize level 3 in

hannes23 wrote on Wednesday, November 01, 2017:

the package 160919_FreeRTOS_Labs contains a file memcpy.c, which doesn’t use fpu registers.
But unfortunately when compiling this file using optimizelevel 3, there will be a compile error
in the routine memset().
The error happens only, when the size to copy is not quad-aligned, which is the case in
FreeRTOS_DHCP.c, line 848.

The problem is visible when stepping through the assemly code and leads to a crash by pushing registers
to the stack endlessly.
To avoid the crash one could use the word volatile at the following union:
union xPointer {
volatile uint8_t *u8;
uint16_t *u16;
uint32_t *u32;
uint32_t uint32;

But on the other hand, this makes the code probably less efficient.
I’d like the comment of an expert here :slight_smile:


heinbali01 wrote on Wednesday, November 01, 2017:

Hi Hannes, you say that you want to avoid that FPU registers are being used by the ( standard ) memcpy.
I would like to simulate this situation. Can you tell me which CPU you are using ( Cortext A9 ? ) and also what compiler and version?
We mostly tested with the latest versions of GCC and optimisation level -Os, optimise for size.

hannes23 wrote on Thursday, November 02, 2017:

Hello Hein,
the CPU is Cortex A9, GCC version is 6.2.1 from the tool-chain SDK-2017.2.
I have attached the disassembly file, i.e. what I see in the disassembly window of the debugger.
The error happens in the memset routine, when due to non-quad-aligned size the last bytes are to
be set. This is in line 153 to 164. The source and destination registers r2, r0 are compared and when
unequal a jump to start of routine is performed. There the pointer difference is compared to 7, then the register pushes are done and the PC jumps to the same location (line 160) and the story continues.
I hope this is a bit enlightening.


heinbali01 wrote on Thursday, November 02, 2017:

Thanks, I will study it and come back to it.

davidbrown wrote on Thursday, November 02, 2017:

The attachment you made is for memset, not memcpy.

There was a related thread not long ago, “memcpy in ISR”. You should probably read through it to see what you can learn:

In general, I would say that you should be using the compiler’s builtin memcpy. It will usually be faster than anything else, will correctly handle alignment (much more efficiently in most cases), and will be far smaller and faster when you have known small sizes. If you have a problem with it (such as the use of FPU registers), then you have a problem with your compiler setup and flags - the builtin memcpy is a symptom of the problem, but replacing the memcpy will not fix it except perhaps by luck. After all, the compiler can use the same registers in other code - including the custom memcpy implementation. (If the problem is a bug in the compiler, that’s another matter - but if that’s the case, please report it!).

I’d also question the choice of using -O3. In practice, it is quite rare for -O3 to give significantly faster code - but it will often give significantly bigger code, and depending on things like caches, this in itself will mean slower code. Of course, code should be correct regardless of optimisation options - failures at -O3 can be helpful in finding bugs in either the compiler or the user code.

heinbali01 wrote on Thursday, November 02, 2017:

Hi Hannes, thanks for reporting this problem.

I tried optimisation level 3 (-O3) on my Zynq. It shows the same problem.

I liked this implementation that uses a union of pointers, because the code can easily be read. But apparently -O3 doesn’t fully understand unions.

Just like David, I’m not a fan of using -O3. It may generate a lot of assembler code in order to save a few CPU cycles.

Solutions: either use another optimisation level (e.g. -Os), or give each task a so-called FPU context (see configUSE_TASK_FPU_SUPPORT), and use the standard memcpy()/memset().

However the price is high: for each task switch 260 bytes must be stored and restored on the stack.

hannes23 wrote on Friday, November 03, 2017:

Hello Hein,

nice to see, that you could reproduce the error.
I would call it a compiler error, no matter what ever.
And I would also like the use of the union of pointers. I did similar in the past by myself.

At my actual project I have to send tcpip packets with mostly achievable speed.
So optimal routines and code optimisation in general are an issue.
Therefore I did some measurements to compare different optimisations.

  1. Use of O3, memcpy.c from demo project: ERROR
  2. Use of Os, memcpy.c from demo project 578 Mbit/s
  3. Use of O3, memcpy.c from demo, added volatile specifier 600 Mbit/s
  4. Use of Os, memcpy.c from demo, added volatile specifier 575 Mbit/s
  5. Use of O3, memcpy, memset… from GCC library 713 Mbit/s
  6. Use of Os, memcpy, memset… from GCC library 699 Mbit/s

So what I take as lesson learned in this case is

  1. Use of builtin library functions makes sense. Big impact!
  2. Use of optimisation O3 gives performance increase, although not that much.

Regarding these speed optimisations I’ve started already a thread some days ago.
So I’d like to ask again what settting or else could be done to get best transmit performance.
But probably I have already done everything possible, because 713 Mbit/s is not too bad.


davidbrown wrote on Friday, November 03, 2017:

Using a union of pointers is not safe. When you have this:

union xPointer {
uint8_t *u8;
uint16_t *u16;
uint32_t *u32;
uint32_t uint32;

The compiler is allowed to assume that u16 and u32 fields never point to the same data, that any changes made through one of the pointers cannot affect changes visible through the other, and that this also applies to mixes when you have more than one xPointer union.

Code that uses a union of pointers like this works by luck, not by design - and changes to optimisation levels and details about the alignments of the pointers can change that luck.

Pointers to char types - like uint8_t - may alias data accessed through other pointers. But even then, there are complications.

I have not studied the memcpy code closely enough to see if it is correct or to see what is actually going wrong (I also don’t have the same type of ARM for testing it).

My general advice is do not use a union of pointers - and be very careful of any sort of mixes of pointer types. Let the compiler do it - it will, baring compiler bugs, get the aliasing issues right.

Note also that newer gcc (like gcc 6) does more aggressive alias analysis than older versions - code that worked in practice may now fail. This is not a compiler bug - it is a matter of incorrect code that happened to work before.

If the code in question can use gcc extensions, then you would be wiser to use types defined like this:

typedef uint8_t __attribute__((may_alias)) uint8_a;
typedef uint16_t __attribute__((may_alias)) uint16_a;
typedef uint32_t __attribute__((may_alias)) uint32_a;

union xPointer {
  uint8_a *u8;
  uint16_a *u16;
  uint32_a *u32;
  uint32_t uint32;

In this case, you are specifically telling gcc that the pointers may alias.

It is also possible to use the “-fno-strict-aliasing” compiler switch, but normally you want type-based alias analysis optimisations for the rest of the code. It can be worth testing, however.

Since I don’t have the opportunity to test the code (no Cortex-A chips), and haven’t spotted anything obvious in the generated assembly, I cannot be sure - but aliasing issues are the most likely cause for this kind of problem. Compiler bugs cannot, of course, be ruled out. All we know for sure is that something somewhere is wrong, and should be fixed - code should not break when compiled at -O3.

It makes sense to try to identify exactly what is wrong here, so that either the memcpy.c code can be fixed, or the compiler bug reported and fixed. That applies even if the right answer to your question is simply to use gcc’s builtin memcpy.

I would greatly appreciate hearing results of testing the demo project memcpy with -O3 and -Os, first using the “-fno-strict-aliasing” flag, and then using the “may_alias” types given above.

heinbali01 wrote on Sunday, November 05, 2017:

Thank you both, Hannes and David.
One general remark: FreeRTOS is not a GCC project. The kernel is compatible with almost any compiler that is being used for embedded.
The plus libraries are being compiled and used by all major compilers and work benches.

The build-in memcpy will be faster if it uses 64-bit FPU registers, but do not forget the price: for every task switch 2 x 260 bytes must be pushed and popped from stack.

As for GCC, where the current problem occurs, I did read about -fno-strict-aliasing, but I did not encounter the may_alias attribute yet. I will try that.

I will report about my findings later today.

heinbali01 wrote on Sunday, November 05, 2017:

I’m afraid that -O3 is just buggy.

I tried the simple byte-by-byte versions of memcpy() and memset().
This memcpy() works fine, but my memset() makes the application crash:

    void *memset( void *pvDest, int iValue, size_t ulBytes )
    uint8_t *pcDest = ( uint8_t * ) pvDest;
    uint8_t ch = ( uint8_t ) iValue;
        while( ulBytes )
            *( pcDest++ ) = ch;
        return pvDest;

Now look at the assembler code with -O3 :

00113f48 <memset>:
  113f48:	e3520000 	cmp	r2, #0
  113f4c:	e92d4010 	push	{r4, lr}
  113f50:	e1a04000 	mov	r4, r0
  113f54:	0a000001 	beq	113f60 <memset+0x18>
  113f58:	e6ef1071 	uxtb	r1, r1      // Zero extend Byte. Extends an 8-bit value to a 32-bit value
  113f5c:	ebfffff9 	bl	113f48 <memset> // Branch with Link.
  113f60:	e1a00004 	mov	r0, r4
  113f64:	e8bd8010 	pop	{r4, pc}

It pushes two 32-bit register within every loop.

I tested the above code on a Cortex A9 ( Altera Cyclone V SoC ), using GCC 6.2.0 :

    arm-altera-eabi-gcc (Sourcery CodeBench Lite 2016.11-59) 6.2.0
    Copyright (C) 2016 Free Software Foundation, Inc.

Maybe this is a good example to report as a bug?


heinbali01 wrote on Sunday, November 05, 2017:

I wrote:

I’m afraid that -O3 is just buggy.

I should have written this more precisely: it looks like the GCC optimisation level 3 for the Cortex A9 is buggy.

Hannes’ original complaint was:

The problem is visible when stepping through the assembler code and
leads to a crash by pushing registers to the stack endlessly

Sounds the same as I observed just now with a totally different function.

I will stop investigating this until a GCC bug-report has been submitted.

At my actual project I have to send tcpip packets with mostly achievable speed.
So optimal routines and code optimisation in general are an issue.
Therefore I did some measurements to compare different optimisations.

The influence of memcpy() will be small compared to the performance gains that you can obtain by tuning ( configuring ) FreeRTOS+TCP:

● Use the zero-copy method in both direction ( ipconfigZERO_COPY_TX_DRIVER / ipconfigZERO_COPY_RX_DRIVER )
● Assign big buffers to the TCP sockets ( see FREERTOS_SO_WIN_PROPERTIES )
● Use large TCP windows ( FREERTOS_SO_WIN_PROPERTIES )

Here is the latest driver for Xilinx:

hannes23 wrote on Monday, November 06, 2017:

Hello Hein, hello David,

thank you both for your advice and helpful comments.
And Hein, thank you for your support regarding TCP speed optimisation.

I appreciate this very much.

Best regards

hannes23 wrote on Monday, November 06, 2017:

Hello Hein,

although this is of course not the right thread to ask this question, I’d like to comment on the
file latest, which you attached to your last post.
Only 2 files are different to my actual NetworkInterface-files. And when I exchange the file NetworkInterface.c the transmit speed slows down from 720 Mbit/s to 44 Mbit/s.

Is this due to the call of the function ulReadMDIO( PHY_REG_01_BMSR ) ?

For the time being I stay with the previous file from: June 05, 2017, 4:02:15 AM

Best regards

heinbali01 wrote on Tuesday, November 07, 2017:


We found out a few things:

● the functions memcpy() and memset(), as implemented in memcpy.c, may both crash when compiled with GCC and optimisation level -O3.

David Brown saw that the compiler will replace a loop with a call to either memcpy() or memset(). That causes the recursive call to itself.

David filed a GCC bug report here.

● What I saw is that a simple loop with an assignment:

    while( pxSource.u8 < pxLastSource.u8 )
        *( pxDestination.u8++ ) = *( pxSource.u8++ );

will generate 181 instructions when using -O3. The resulting code looks like an efficient implementation of memcpy(). In the middle of that code, FPU registers are used to make copying faster. That is what they call “aggressive optimisations”.
When a counter is being tested in the same loop, the compiler follows the logic and it will make simple code, for example like this:

    while( ( pxDestination.u8 < pxLast.u8 ) && ( ulBytes != 0ul ) )
        pxDestination.u8[ 0 ] = ( uint8_t ) iValue;

Here below I attach a new version of memcpy() that also works well with -O3: it is called memcpy_gcc_O3.c.

The problem had nothing to do with the union of pointers.

Still, I prefer to use optimsation level -Os.

● I found a way to avoid the use of FPU registers by the compiler and by CLIB, by simply omitting the -mfpu (-mfpu=neon) option.
When I use arm-xilinx-eabi-gcc, the -mfpu=neon will be added automatically.
However, when I use e.g. arm-none-eabi-gcc, all hardware options must be provided manually, and I can leave-out the -mfpu option.
Except maybe for portASM.S, for portSAVE_CONTEXT() and portRESTORE_CONTEXT().

I must admit that the performance of the GCC/Xilinx build-in memcpy() is really fast when it uses the FPU registers. You might consider using it and define in FreeRTOSConfig.h:

    /* Every task will have a floating point context. */
    #define configUSE_TASK_FPU_SUPPORT      2

On Cyclone VSoC, I saw that memcpy.c is faster than the build-in memcpy() for some unaligned cases ( all without FPU ).

● About you are right, thanks! I hadn’t realised that polling the PHY status is very disturbing for the packet transport.
There is nothing wrong with the implementation of ulReadMDIO(), calling it has a fixed duration of a few uS. The problem is that transport is slowed-down ( for me by 90% ).
In this thread you can read why I wanted to poll the PHY more often.


heinbali01 wrote on Tuesday, November 07, 2017:

As I was doing performance tests, the module here above provides two functions x_memcpy() and x_memset().
These should be renamed to memcpy() and memset()
Corrected in the current attachment memcpy_gcc_O3_v2.c.

thomask wrote on Tuesday, November 07, 2017:

I’m not sure if this qualifies as a GCC bug.

Similarily, the compiler may (and will!) replace printf(“Hello\n”) by puts(“Hello”). You can use -ffreestanding to disable these optimizations:

FPU instructions in memcpy is a newlib thing:

Unfortunately, there seems to be no single explicit flag to disable the code. You can either patch newlib, or misuse “-mno-unaligned-access” to work around. CodeTime seems to do the same thing on their toolchain:

(For this and other reasons, I have build my own toolchain:

davidbrown wrote on Tuesday, November 07, 2017:


In order to avoid using the FPU registers, you then need to have a different implementation of the memcpy function linked in, so that you never use the newlib one. I’ve been in contact with Hein off-list about his implementation - I expect he’ll post again once his is happy with a suitable version for FreeRTOS.

But it also means there is no problem using the gcc builtin memcpy. This will generate optimal code in cases where the sizes and alignments are known at compile time, or call “memcpy” otherwise. So it is still safe to use it, as long as you have your own non-FPU version of the memcpy function.