Lowest overhead raw ethernet frame protocol


Some time ago I had some significant and greatly appreciated assistance from @htibosch to get the +TCP stack up and running on SAME70 MCU. The stack worked exactly as desired (with the exception of DCache support, which I never got around to addressing). However, following various use case analyses and longer term testing, I’ve determined that the use of a TCP-based approach is overkill for my particular application and contains far too much overhead.

For a bit more context, I’m looking to interface an MCU with a co-processor (a Cortex-based core) on the same PCB. Both the MCU and the co-processor have a PHY, each connected directly to each other via traces (i.e. no connectors, magnetics, etc.). As this is a ‘direct’ connection, I’d ideally like to use it as some sort of DMA-based buffer mechanism, whereby a raw ethernet frame (i.e. below the IP layer) is sent by one and received by the other.

My initial question is this: do I need the +TCP suite for this? I’m looking for the lowest possible overhead to allow for the highest possible throughput. The actual structure of my custom protocol (i.e. the use of metadata to be sent with the payload) will be constructed once I know how to shift the raw data out of the MCU without relying on the IP (and TCP) stack. My intention would be to obviously use a DMA approach to minimise MCU processing as much as possible.

Chances are that you won’t even need the PHYs and you can connect the MACs directly e.g. via RGMII. For raw ethernet frame transfer you don’t need a full network stack. The MAC can be used as any other peripheral and you just need a MAC/ethernet driver to send/receive ethernet frames with a fixed configuration e.g. 100MBit/full-duplex.
In your case you could re-use the existing ethernet driver and maybe adopt the buffer management and send/receive interface matching your needs.
I’m sure this was already done by others and you’ll find some more information in the net.

Thanks for that Hartmut, your input is very much appreciated!

The RGMII proposition is an interesting one. Unfortunately, in this particular instance, the Cortex processor has an integrated PHY, so I don’t have the option of a direct RGMII to RGMII interface. Would the same MAC/ethernet driver approach still work if using the PHYs? Apologies in advance for the basic questions, low-level networking is not an area with which I have much experience!

Except the link (auto-)negotiation the PHY is transparent with regard to the ethernet communication. So no problem. You might only need a PHY driver in addition to bring up the link. Usually PHYs can be bootstrapped with external resistors to adjust the desired default configuration e.g. 100MBit/FD. Then you could even omit a PHY driver to handle link up/down/speed/mode if you can expect that the link configuration is fixed and never breaks. I’d perhaps do that later on once the PCB has matured. Otherwise the PHY driver is very small and has no runtime impact. It’s also fine to keep it.
I’m curious: Which MCU has an integrated ethernet PHY ?

That all makes sense, thanks for your explanation.

It’s not an MCU as such, it’s an ARM-based multi-core processor assembly. It provides the integrated PHY as a means of making it easier to integrate the assembly into a design.

I’m using a KSZ8081 PHY on the MCU side, and the ARM assembly uses a KSZ9031. Both of these support auto-negotiation as default; does this mean that they can bring up the link and agree on a speed independent of any drivers? The default/reset values for each PHY should result in 100BASE-TX, full-duplex auto-negotiation outcome if I’m reading the datasheets correctly.

Yes. I’m using a system with a similar fixed HW configuration including a PHY with the same bootstrap configuration. This would work without any intervention.
You might need a synchronization to start your ethernet protocol handlers after the link is up to make your life easier. In other word it’s good to know when it’s safe to successfully send an ethernet frame.
That’s the reason why I’m using a minimal PHY driver to get notified when the link is up and it’s safe to activate the network stack (which might start a DHCP sequence, …).
The PHY has an INT signal output connected to the MCU which I’m using to get a GPIO-IRQ. For diagnostic purposes I’m verifying the link state and notify the network stack.
If you don’t care you could just delay the start of the protocol handlers by about 3 sec (see the data sheet of the PHY, the max. auto-negotiation time is documented).
I normally do NOT recommend synchronization by delay but in this case it could be good enough :slight_smile:
Alternatively you can also use the generic PHY driver coming with FreeRTOS+TCP stack.
It might be useful at least while evaluating the 1st prototype PCBs :wink:

Thanks again Hartmut, your input has helped enormously.

I’ve spent much of today trawling through the +TCP source, as well as looking for example MAC/raw ethernet frame drivers online. I keep thinking that this ethernet approach just seems overly complicated for what I’m ultimately trying to do:

  1. The MCU sends measurement data to the Cortex processor.
  2. The Cortex processor sends configuration data (on boot and sporadically during runtime) to the MCU.

The Cortex processor is running a highly customised embedded OS based on the Linux kernel, so while certainly doable, it looks to be an involved process making this work across different architectures (and making the data available in user space within the embedded OS).

So. What I had initially proposed before looking at this ethernet approach, was to instead use SPI. I would setup the MCU as the SPI Master, and operate a MOSI-only (i.e. unidirectional) SPI interface with the Cortex processor. The latest Linux kernels allow SPI Slave configuration, so accessing the data in user space is certainly possible. I can then use a UART interface for the exchange of configuration and confirmation data, to leave the higher speed/throughput SPI interface solely for the measurement data.

This approach appeals to me for a couple of reasons. Firstly, I’ve already built DMA-based SPI and UART drivers for this MCU, which have been tested and work nicely. Without having done the tests, I can only assume the computational overhead of this approach would have to be lower than that of the ethernet approach, even if using a very rudimentary raw ethernet frame driver. The other major reason, is not having to deal with the GMAC and associated processes within the Linux kernel (and by extension in the embedded OS). There are other networks in use within the custom OS, so segregating this function entirely from other network-based functionality would be a big advantage.

Is this approach completely crazy? I also have USB available but I imagine the complexity might put it inline with the ethernet approach.

It’s not crazy at all. If the SPI interface is fast enough on both sides, it’s likely the lowest overhead and simple solution. You need to check which (discrete !) baud rates are supported by the respective SPI controllers and find a common denominator. Then I’d verify that the end-to-end performance matches your requirements.
But I doubt that you get near the 100MBit an ethernet link provides.
The (slow) control connection via UART shouldn’t be a problem even though the serial/TTY stack on Linux involves a considerable overhead, latencies etc.
Dealing with raw ethernet on the MCU/FreeRTOS side wouldn’t be too complicated and since the other OS is Linux you have the raw socket interface accessible from user space. Especially if the ethernet interface is exclusively connected to the MCU and not to another network this should be doable, too. But I’ve no experience with that.
While thinking about it, instead of trying to optimize networking overhead are you sure that using simple and standard UDP is not sufficient ? UDP is much cheaper than TCP and it’s standard… it’s hard to tell :wink:

All very good points.

I’ve checked the SPI controllers on both sides, and it looks like 40Mbit/s is the fastest common speed. The Cortex side actually supports 60Mbit/s but only in Master mode, so 40Mbit/s is the fastest possible with the MCU as the Master.

I’ve done some calculations today, and am fairly confident that the actual payload throughput of SPI will be considerably higher than the ethernet approach, even though the ethernet interface is ‘faster’ (100Mbit vs. 40Mbit). Even if I were to use UDP, which is definitely cheaper than TCP as you’ve said, there’s still upwards of 50 bytes of non-payload data (i.e. headers, etc.) per transaction.

Obviously a lot of that can be removed using the raw ethernet frame and/or a custom ethertype, but I still wonder I’d be better off sticking with SPI. There’s another option her I’m also considering, which is actually using QSPI (which is available on both sides) and shifting 4 bits per clock, which has the potential to increase the throughput significantly. The use of QSPI for something other than simply read/writing to a flash slave is something with which I’m not familiar, so whether this is possible/practicable I’m not yet sure.

You’re right, even with a single lane you get 40 MBit throughput with least overhead and a very thin and rather simple software stack on both sides.
I guess QSPI is just SPI and with this feature and 40 MHz clock it’s certainly the best option.
Good luck :+1:

Thanks Hartmut, I’ve really appreciated your input!