FreeRTOS+UDP end-of-buffer padding to support store-and-forward hardware checksumming

zbrozek wrote on Sunday, April 23, 2017:

I’m writing a from-scratch (not using stm32f7xx_hal_etc.{c,h}) STM32F7 zero-copy Ethernet driver for FreeRTOS+UDP 9.0.0. It mostly works (and the bugs in my driver aren’t what I’m hoping to address in this post), but I have some concern about a hack I seem to have to do.

In particular, the reference manual in the “Transmit checksum offload” section states:

You must make sure the Transmit FIFO is deep enough to store a complete frame before that frame is transferred to the MAC Core transmitter.

and it also says:

Note that: for ICMP-over-IPv4 packets, the checksum field in the ICMP packet must always be 0x0000 in both modes, because pseudo-headers are not defined for such packets. If it does not equal 0x0000, an incorrect checksum may be inserted into the packet.

But it doesn’t appear like FreeRTOS+UDP provides four unused bytes at the end of an Ethernet buffer. In order to get the checksum to be automatically inserted, I must tell the DMA peripheral that the frame is four bytes bigger than what FreeRTOS+UDP suggests, and I also have to set the last four bytes to zeros. This of course is extremely dangerous and I shouldn’t do it.

Have I missed something about the buffer allocation size that makes this OK? Or is there a way to specify end-of-buffer padding somewhere? I’ve noticed that this page is missing some details.

heinbali01 wrote on Sunday, April 23, 2017:

Hi Sasha,

I’m writing a from-scratch STM32F7 zero-copy Ethernet driver for FreeRTOS+UDP 9.0.0.

Congratulations !

Do you really mean +UDP, or +TCP?
FreeRTOS+TCP started in the /Labs section. It is the successor of +UDP. It is more complete, and more often used. If you want to use it UDP-only, that is still possible:

    #define ipconfigUSE_TCP    0

Moving from +UDP to +TCP shouldn’t be hard. The principles are all the same.

The drivers for both releases are 99% the same. FreeRTOS+TCP has an extra parameter when calling xNetworkInterfaceOutput():

    BaseType_t xNetworkInterfaceOutput( NetworkBufferDescriptor_t * const pxDescriptor, BaseType_t bReleaseAfterSend );

When bReleaseAfterSend is true, the driver ( NetworkInterface ) is responsible for releasing pxDescriptor. When you have zero-copy transmissions:

    #define ipconfigZERO_COPY_TX_DRIVER    1

… then bReleaseAfterSend will always be true, so it is safe to pass the memory to a DMA descriptor. Once the transfer is ready, the original Network Buffer must be released.

I like to work with a counting semaphore that keeps track of the number of available DMA descriptors for transmission (TX).

In the current STM32F4 driver, it looks like this:

    /* As soon as the EMAC has sent a frame, the original Network Buffer must be released. */
    ucPayLoad = ( uint8_t * )DMATxDescToClear->Buffer1Addr;

    /* Look-up the Network buffer that owns a certain payload. */
    pxNetworkBuffer = pxPacketBuffer_to_NetworkBuffer( ucPayLoad );

    /* Release the Network buffer. */
    vReleaseNetworkBufferAndDescriptor( pxNetworkBuffer ) ;

    /* Mark that it has been released already, just to be sure. */
    DMATxDescToClear->Buffer1Addr = ( uint32_t )0u;

    /* Tell the counting semaphore that one more TX descriptor is available. */
    xSemaphoreGive( xTXDescriptorSemaphore );

for ICMP-over-IPv4 packets, the checksum field in the ICMP packet must always be 0x0000 in both modes

I also encountered this ‘feature’ on STM32F4’s EMAC :

    if( pxPacket->xICMPPacket.xIPHeader.ucProtocol == ipPROTOCOL_ICMP )
    {
        pxPacket->xICMPPacket.xICMPHeader.usChecksum = ( uint16_t )0u;
    }

Just to summarise, for IPv4, the following check-sums exist:

  1. IP-header has a checksum over a so-called pseudo header (16-bits)
  2. Protocol-checksum (16-bits)
  3. A 4-byte Frame Check Sequence (FCS) that we do not get to see in the library.

Checksums 1) and 2) can either be set and checked by the EMAC, or by the library.
The last checksum can only be added by the EMAC.

Now when a packet has 100 bytes, we send 100 bytes to the EMAC, and the EMAC will secretly add 4-byte frame-check.

But there is a minimum frame length of 60 + 4 bytes. That is needed for a proper collision detection on the LAN.

In FreeRTOS+TCP, you can define the following:

    #define ipconfigETHERNET_MINIMUM_PACKET_BYTES        ( 60 )

ARP and ICMP messages can be as short as 42 bytes. The 18 extra bytes will be set to zero. So 60 bytes will be sent to the EMAC.

But it doesn’t appear like FreeRTOS+UDP provides four unused bytes at the end of an Ethernet buffer.

True because the EMAC adds 4 bytes on its own.

In order to get the checksum to be automatically inserted,
I must tell the DMA peripheral that the frame is four bytes
bigger than what FreeRTOS+UDP suggests

That is new for me, I haven’t encountered that yet. But I haven’t used an STM32F7 either.

Have I missed something about the buffer allocation size that makes this OK?

FreeRTOS+TCP will guarantee that all network buffers have a minimum size of ipconfigETHERNET_MINIMUM_PACKET_BYTES. That is particularly important when using BufferAllocation_2.c, which allocates just enough bytes for pucEthernetBuffer.

/* Adding pad bytes. */
#if defined( ipconfigETHERNET_MINIMUM_PACKET_BYTES )
{
    if( pxNetworkBuffer->xDataLength < ( size_t ) ipconfigETHERNET_MINIMUM_PACKET_BYTES )
    {
    BaseType_t xIndex;

        for( xIndex = ( BaseType_t ) pxNetworkBuffer->xDataLength;
             xIndex < ( BaseType_t ) ipconfigETHERNET_MINIMUM_PACKET_BYTES;
             xIndex++ )
        {
            /* FreeRTOS+TCP guarantees that pucEthernetBuffer is long enough. */
            pxNetworkBuffer->pucEthernetBuffer[ xIndex ] = 0u;
        }
        pxNetworkBuffer->xDataLength = ( size_t ) ipconfigETHERNET_MINIMUM_PACKET_BYTES;
    }
}
#endif

Or is there a way to specify end-of-buffer padding somewhere? I’ve noticed that this page is missing some details.

An EMAC will understand the trailing zero’s at the end. They will be transmitted, followed by an FCS.

Here you find the latest NetworkInterface.c for the STM32F4.

This and other drivers will (hopefully) soon appear in the official FreeRTOS+TCP release. Regards.

zbrozek wrote on Monday, April 24, 2017:

I had written a much longer response, but sourceforge ate it. In short:

Yes, I’m using +UDP because it seemed like the intended IP stack (mainlined, as opposed in a Lab). As I’ve been doing research it’s been getting more and more clear that I really should be using +TCP. Hopefully it plays nice with the 9.0.0 release. Right now I’m having some difficulty getting it to compile since IAR 7.80.1 seems to be ignoring the semicolon in pack_struct_end.h. I’ll fuss with it and see where I get.

Thank you so very much for your long and detailed response. The pointer to your F4 driver makes it clear that I was never getting hardware checksumming at all, and in fact my size error was due to the checksumming not adding the bytes at all. One of the bugs in the driver was that the FCS was always wrong. Now I understand that it was garbage data.

heinbali01 wrote on Monday, April 24, 2017:

Hi Sasha,

it [+UDP] seemed like the intended IP stack (mainlined, as opposed in a Lab).

FreeRTOS+TCP should have moved to mainline already. +TCP has been out and stable for a couple of years. The last stable release was 160919. A new version with some new features is due to be released soon.

As I’ve been doing research it’s been getting more and more clear that I really should be using +TCP.

I also think so.

Hopefully it plays nice with the 9.0.0 release.

It works well with any kernel from 8.1.1 up to the latest. I use 9.0.0 for all platforms that I have. It needs event_groups.c, it does not make use of timers.c.

Right now I’m having some difficulty getting it to compile since
IAR 7.80.1 seems to be ignoring the semicolon in pack_struct_end.h.
I’ll fuss with it and see where I get.

That is a known issue. Have you found this and this post already?

+UDP didn’t have portable\Compiler\IAR directory yet, how did you compile it?

Would it help if we add a macro ipconfigPACK_END, which will be placed as follows:

#include "pack_struct_start.h"

/* contains '__packed' */

struct xETH_HEADER
{
    MACAddress_t xDestinationAddress; /*  0 + 6 = 6  */
    MACAddress_t xSourceAddress;      /*  6 + 6 = 12 */
    uint16_t usFrameType;             /* 12 + 2 = 14 */
}
#include "pack_struct_end.h"

/* for IAR, an empty header file. */

ipconfigPACK_END

/* for IAR, defined as a semicolon. */

Could you try that? Regards.

heinbali01 wrote on Monday, April 24, 2017:

There are 21 occurrence of

#include "pack_struct_end.h"

which should all be replaced by :

#include "pack_struct_end.h"
ipconfigPACK_END

And add a few lines in FreeRTOSIPConfigDefaults.h:

#ifndef ipconfigPACK_END
    #define ipconfigPACK_END
#endif

zbrozek wrote on Monday, April 24, 2017:

I tried doing what +UDP does and added __attribute__( (packed) ); into the pack_struct_end.h file. It works. As far as I can tell, IAR ignored the GCC-style attribute, but the additional text was enough to get around the preprocessor bug.

I also tried your workaround. No dice, including when trying to actually define ipconfigPACK_END as a semicolon. Preprocessor output looked the same as before - no semicolons at the end of struct definitions.

FreeRTOS_IP.h needed to be #include in FreeRTOS_TCP_IP.h and FreeRTOS_Stream_Buffer.h.

The compiler emits a warning about FreeRTOS_Sockets.c having a implicit call to FreeRTOS_min_BaseType unless ipconfigHAS_INLINE_FUNCTIONS set to 1. I set it to 1 and all seems well, but it seems like that shouldn’t happen when using options.

heinbali01 wrote on Tuesday, April 25, 2017:

Glad to hear that it is OK to write it as __attribute__( (packed) );, and thanks for reporting back.

I must admit that I can not test the code with IAR at this moment ( haven’t installed it yet on my new laptop ).

The trick with ipconfigPACK_END did work properly with GCC, but that doesn’t help the IAR projects.

Thanks for pointing-out that defining ipconfigHAS_INLINE_FUNCTIONS = 0 leads to compiler warnings. In fact some defines were missing. Here they are:

	#define FreeRTOS_max_BaseType(a, b)  ( ( ( BaseType_t  ) ( a ) ) >= ( ( BaseType_t  ) ( b ) ) ? ( ( BaseType_t  ) ( a ) ) : ( ( BaseType_t  ) ( b ) ) )
	#define FreeRTOS_max_UBaseType(a, b) ( ( ( UBaseType_t ) ( a ) ) >= ( ( UBaseType_t ) ( b ) ) ? ( ( UBaseType_t ) ( a ) ) : ( ( UBaseType_t ) ( b ) ) )
	#define FreeRTOS_min_BaseType(a, b)  ( ( ( BaseType_t  ) ( a ) ) <= ( ( BaseType_t  ) ( b ) ) ? ( ( BaseType_t  ) ( a ) ) : ( ( BaseType_t  ) ( b ) ) )
	#define FreeRTOS_min_UBaseType(a, b) ( ( ( UBaseType_t ) ( a ) ) <= ( ( UBaseType_t ) ( b ) ) ? ( ( UBaseType_t ) ( a ) ) : ( ( UBaseType_t ) ( b ) ) )

( the casts are necessary just in case these macro’s are being called with constant values ).

Something about “static inline”:

The FreeRTOS kernel has never used static __inline functions. The reason is that some compilers wouldn’t understand it.

When developing FreeRTOS+TCP, the need for static inline functions increased, The have several advantages compared to macro’s:

● they allow proper type-checking of the parameters
● the compiler will decide whether the functions will either be inlined or linked, depending on the generated code-size
● they avoid a possible multiple evaluation of it’s parameters
● The use of function-like macro’s can lead to hard-to-detect errors

An example of the last :

/* What if 'ptr' is a number, in stead of a pointer? */
#define  STORE_BYTE( ptr, value )    ( ( ( uint8_t *) ptr ) = ( value ) )

Note that an older compiler for TI parts did generate erroneous ( crashing ) code in case static inline functions were used in FreeRTOS+FAT, by defining ffconfigINLINE_MEMORY_ACCESS = 1.
TI solved that in a newer release of the compiler.

Regards.

zbrozek wrote on Tuesday, April 25, 2017:

Added those in and it’s clean both with and without the definition - thanks!

I’m rewriting my driver for BufferAllocation_2.c and the deferred processing task it requires. I’ve got a couple of questions:

  • Why is it recommended not to release buffers to the stack on confirmation of transmission, but rather on the attempt to transmit again? It seems like that’ll leave a lot of orphaned buffers, up to the high water mark during a transmit burst.
  • How does FreeRTOS work on processors with a data cache and DMA? With BufferAllocation_1.c I can configure the MPU to force write-through in the appropriate directions, or to manually invalidate the cache as necessary. There doesn’t seem like quite as elegant a way to do that with BufferAllocation_2.c. Until I’ve figured out how I want to deal with it, I’m just not enabling the data cache.

Thanks again!
-Sasha

heinbali01 wrote on Tuesday, April 25, 2017:

Hi Sasha,

Added those in and it’s clean both with and without the definition - thanks!

Now added to mainline, thanks to you.

( some users encounter imperfections, correct them, but do not bother about reporting them on a forum )

I’m rewriting my driver for BufferAllocation_2.c and the deferred processing task it requires.

Rewriting? Does it need a different approach?
I think that for a driver, the only difference is that you do not need vNetworkInterfaceAllocateRAMToBuffers() any more. Or do you have more changes?

There has been a driver using BufferAllocation_2.c, that would pass the pucEthernetBuffer field of a Network Buffer to DMA, assign NULL to it, and release the Network Buffer. That is OK, as long as BufferAllocation_2.c is used, and therefore I would not recommend that method.
I prefer to hand over the Network Buffer to DMA (only for a few µS or 1 mS), and when transmission is ready, call pxPacketBuffer_to_NetworkBuffer() to get the pointer to the owning Network buffer. That method works for both BufferAllocation_1 and BufferAllocation_2.

BufferAllocation_2.c and the deferred processing task it requires

Why wouldn’t BufferAllocation_1.c need a deferred processing task? I think it’s all the same? Except for the trick described here above?

Why is it recommended not to release buffers to the stack on confirmation,
of transmission but rather on the attempt to transmit again?
It seems like that’ll leave a lot of orphaned buffers,
up to the high water mark during a transmit burst.

( it is not recommended, it was caused by laziness… or lack of time to write it better )

I think that you found this in an older driver ( indeed maybe from the latest release ). What you see there was not intentional.

In e.g. this recent driver:

https://sourceforge.net/p/freertos/discussion/382005/thread/39bcfcfb/3ee0/attachment/STM32Fxx_networkInterface_and_iperf.zip

( I did add a link here above ) you will see that the Network Buffer is being released immediately after the EMAC confirms the (success or failure of a) transmission.

How does FreeRTOS work on processors with a data cache and DMA?

Good question !

With BufferAllocation_1.c I can configure the MPU to force write-through in the appropriate directions,
or to manually invalidate the cache as necessary. There doesn’t seem like quite as elegant a way to do
that with BufferAllocation_2.c. Until I’ve figured out how I want to deal with it, I’m just not enabling
the data cache.

( the following text is based on my experience with Xilinx Zynq, which has a zero-copy driver, and a lot of cached DRAM )

With BufferAllocation_2.c, memory will be obtained by calling pvPortMalloc(), which is normally cached memory. When using BufferAllocation_1.c, indeed you can choose where to allocate the memory, cached or uncached.

When using cached memory, before passing it to DMA, it must be flushed. And after receiving data, the memory area must be refreshed before it is up-to-date.

The mistake that I made ( I’ve had headaches about this ), was to flush or refresh parts of memory which had an overlap with other objects!

Lesson: you have to make sure that the starting offset and the length of the memory blocks occupied by pucEthernet are properly aligned with boundaries of the caching.

Using a length of 1536 ( 0x600 ) bytes worked wonderfully.

Later I decided to dedicate 1 MB of uncached DRAM to DMA, both for the MMC driver as well as ETH. Since then, the driver doesn’t worry about flushing and refreshing any more. Regards.

glenenglish wrote on Tuesday, April 25, 2017:

For others reading this
required reading "Cortex M7 programming manual ST doc PM0253
Sections 2.3 , 4.8, 4.9

http://www.st.com/content/ccc/resource/technical/document/programming_manual/group0/78/47/33/dd/30/37/4c/66/DM00237416/files/DM00237416.pdf/jcr:content/translations/en.DM00237416.pdf

It’s updated quite regularly.

zbrozek wrote on Thursday, April 27, 2017:

Rewriting? Does it need a different approach?
I think that for a driver, the only difference is that you do not need vNetworkInterfaceAllocateRAMToBuffers() any more. Or do you have more changes?

Well, I had originally used BufferAllocation_1.c because it would allow me to do all of the buffer swapping in the ISR, thus not needing the deferred processing task at all. So beyond removing that call, I also have to reduce the scope of the ISR and move that functionality into a task. No big deal, just something that needs doing. I’m most-active on weekends (this is a hobby project), so I’m a bit behind on implementing everything.

Right now I’m still wrapping my head around the padding prior to the Ethernet buffers and when I want to make calls to pucGetNetworkBuffer() versus pxGetNetworkBufferWithDescriptor(). This section suggests the former, but at the bottom of this section the example uses the latter. I assume that’s because when I populated the RX DMA descriptors, that I used pucGetNetworkBuffer() and the IP stack needs a populated descriptor to be able to parse the packet.

I’m especially confused by this:

*( ( NetworkBufferDescriptor_t ** )
       ( pxDMARxDescriptor->pucEthernetBuffer - ipBUFFER_PADDING ) ) = pxDMARxDescriptor;

But I haven’t spent much time on it; it’ll probably become clear when I get to studying it more closely.

I started looking in to how to forcibly clean and forcibly invalidate ranges of addresses from the D-cache. Looks like core_cm7.h defines SCB_InvalidateDCache_by_Addr and SCB_CleanDCache_by_Addr which should do just what I need. Yay!

zbrozek wrote on Saturday, April 29, 2017:

OK; I’m now mostly operational with +TCP. My demo board (STM32F746 Discovery) is requesting, receiving, and ACKing a DHCP lease. I implemented your ICMP checksum workaround in my driver. It doesn’t keel over and die when I ping flood the board. Success?!