STM32H755 FreeRTOS+TCP unaligned access fault in FreeRTOS_MatchingEndpoint() memcpy

I’ve run into a strange problem with FreeRTOS+TCP on a Nucleo STM32H755 board. When I run in the debugger, the network stack crashes at the first received packet. Digging further, it’s the memcpy call near the end of FreeRTOS_MatchingEndpoint(), which copies the six-byte destination MAC address into another variable before calling pxEasyFit(). I haven’t added any fancy fault-handling code, but the basic fault registers are saying it’s an unaligned access fault.

Inspecting the data structures prior to the memcpy call, everything looks good to me, although the destination address is indeed unaligned: It is trying to copy six bytes from 0x2402bfa8 (divisible by 8) to 0x300000ca (divisible by 2).

One thing I don’t understand (of many :slight_smile: ) is that the same code runs successfully without crashing when the debugger is not attached. I am running a modified TCP echo server test, and I can ping the board successfully. The echo is not yet working, for reasons which may or may not be related to the above…

Can anyone confirm that the memcpy in question is supposed to be unaligned, and is supposed to work anyway? Or, confirm that it should be aligned and that’s where I should be looking.

I have my network buffers in SRAM1 (128k region starting at 0x30000000) and I believe the MPU is correctly configured. Although for some reason MPU_RBAR reads as 0 in the debugger… it’s unclear to me from the ARMv7-M architecture reference if that is expected behavior or not. (The other MPU registers are reading back as expected.)

FreeRTOS is otherwise working fine, e.g. I have three low-priority tasks blinking separate LEDs. No other interrupt activity on the system yet, and no ST HAL stuff apart from the Ethernet driver in portable/NetworkInterface/STM32/Drivers/H7. I am using BufferAllocation_1.c.

In mentioned statement, memcpy() copies to this struct:

    struct xMAC_ADDRESS
    {
        /**< Byte array of the MAC address */
        uint8_t ucBytes[ ipMAC_ADDRESS_LENGTH_BYTES ];
    }
    MACAddress_t xMACAddress;

Although the struct only contains 6 bytes, we can assume that the compiler gives it a 4-byte aligned address and a length of 8.

    pxPacket->xUDPPacket.xEthernetHeader.xDestinationAddress.ucBytes[6]

And memcpy() will copy from ucBytes[]. This address for sure is 4-byte aligned + 2 bytes.

NetworkEndPoint_t * FreeRTOS_MatchingEndpoint( ... )
{
    ...
    memcpy( xMACAddress.ucBytes,
            pxPacket->xUDPPacket.xEthernetHeader.xDestinationAddress.ucBytes,
            ipMAC_ADDRESS_LENGTH_BYTES );

The compiler thinks that source and destination are both 32-bit aligned, whereas we have put ucBytes[0] at an offset of +2 bytes (see ipconfigPACKET_FILLER_SIZE).

The function memcpy() crashes. I have seen more alignment problems when using optimisation for size (-Os).

The function memset() also produces alignment errors.

I never noticed this problem, because I use my own memcpy().
It is written in C and it is optimised. It also implements memset(). It doesn’t use FPU instructions or registers.
If you want to use this memcpy.c, just download it, include it in your project and it should be used.
Also I recommend to use these flags:

	OPTIMIZATION = -Os -fno-builtin-memcpy -fno-builtin-memset

This will avoid that the compiler will secretly replace calls to memcpy()/memset() with an inline version, which could make the same mistake.

If you insist on using the build-in versions, you can try to mislead them: put them in a separate module:

void * my_memcpy ( void * destination, const void * source, size_t count )
{
    return memcpy ( destination, source, count );
}

void *my_memset( void *destination, int value, size_t count )
{
    return memset( destination, value, count );
}

but you might still need no-builtin-mem...

Thanks Hein!

I will try this, but I can’t help wondering if the problem is fully understood. I still can’t explain why this would run OK without the debugger attached, but cause the unaligned access fault with the debugger attached. That bugs me…

I’m also wondering if your fix is really what we want for the project going forward. If FreeRTOS+TCP is supposed to work with GCC/newlib (as well as IAR and Keil), then it seems like there should be a way to achieve that without hacking substitute memcpy() and memset(). I’m assuming the newlib folks would not consider this a bug, so is there a way to work around it while still using the standard library? What do the other ports do?

(In the “worst case” that newlib’s mem___ are unusable, wouldn’t it be best to commit your customized functions to the repository, give them different names like safeMemCpy, etc., and call those from the code instead of trying to override the standard memcpy/memset?)

Several years ago I resolved and documented (fifth issue at Using exceptions in C++ embedded software | David Crocker's Solutions blog) a related issue on a SAME70 processor, which is also a Cortex M7. The implementation of newlib supplied with gcc for the Arm Cortex M7 is compiled with unaligned memory accesses permitted, and the code generated for memcpy includes an instruction that performs an unaligned memory access in some cases. Normally, unaligned memory access are permitted on the Cortex M7, but I was using a non-cacheable area of memory for which unaligned accesses are forbidden.

It was suggested to me to recompile newlib with the compiler option -mno-unaligned-access added. I didn’t want to do that, so instead I copied the newlib memcpy source and added it to my project, which already had -mno-unaligned-access in the compile options. I did the same for memset and memmove in case they suffered from a similar issue.

GCC mempy() copies from a struct and it is not aware that the struct is (secretly) placed at a location which is “4-byte aligned + 2 bytes”.

The packed declaration doesn’t work as a warning for the compiler.
It is very well possible that the 6 bytes are copied in two instructions like this:


; Copy 32 bits (4 bytes) from [r0] to [r1]
ldr     r2, [r0]        ; Load 32-bit word from source (r0) into r2
str     r2, [r1]        ; Store 32-bit word from r2 into destination (r1)

; Copy 16 bits (2 bytes) from [r0 + 4] to [r1 + 4]
ldrh    r3, [r0, #4]    ; Load 16-bit halfword from source + 4 bytes
strh    r3, [r1, #4]    ; Store 16-bit halfword to destination + 4 bytes

r0is a badly aligned address, and so the first instruction is bound to crash.

David Crocker wrote:

The implementation of newlib supplied with gcc for the Arm Cortex M7 is compiled with unaligned memory accesses permitted

That is indeed also possible, the compiler ignores the packed attribute, because it assumes that unaligned memory access is allowed.

And a last remark: I have seen a compiler ‘optimising’ my memcpy.c by replacing copy instructions with (recursive!) calls to memcpy():

-    pxDestination.u32[ 0 ] = pxSource.u32[ 0 ];
-    pxDestination.u32[ 1 ] = pxSource.u32[ 1 ];
-    pxDestination.u32[ 2 ] = pxSource.u32[ 2 ];
-    pxDestination.u32[ 3 ] = pxSource.u32[ 3 ];
-    pxDestination.u32[ 4 ] = pxSource.u32[ 4 ];
-    pxDestination.u32[ 5 ] = pxSource.u32[ 5 ];
-    pxDestination.u32[ 6 ] = pxSource.u32[ 6 ];
-    pxDestination.u32[ 7 ] = pxSource.u32[ 7 ];
+    memcpy( pxDestination.u32, pxSource.u32, 32 );

That is why I used -fno-builtin-memcpy.

Is it possible that your fault handler ignores the fault and continues?

Well this is curious. I haven’t redefined memcpy yet, but I did go into my compiler settings and change a few things. I added -mno-unaligned-access which was not there previously, and while I was there I also noticed that ST’s default config uses --specs=nosys.specs but does not use newlib-nano, which I usually prefer. So I also prepended --specs=nano.specs to make that change, then rebuilt everything and tested again. This is still a debug configuration with -O0.

For some reason, it now seems happy. I looked at the pointer addresses and of course they are not word aligned, but memcpy isn’t causing any problems. I wonder if the nano version is built differently? In any case, I’ll probably just leave this for the time being and see if there are any transient failures.

Thank you for sharing!

Last time I looked, newlib-nano memcpy just copied 1 byte at a time, so there were never any alignment issues. Whereas newlib memcpy generally copies several dwords at a time, mostly avoiding unaligned accesses, but sometimes performing an unaligned access while copying the last 3 bytes even when the structs being copied are aligned.