TCP Ping, ARP on STMH745 Nucleo, can receive but not transmit

Hi! This is my first FreeRTOS project. I’m working in STMCubeIDE under Ubuntu20 with a config for the STMH745ZI-Q Nucleo I generated from STMCubeMX using GCC as the compiler. I’ve followed the bits of advice I’ve gathered from here, including deleting the CubeMX-generated stm32h7xx_eth.c, enabling ethernet global interrupt, set HAL Timer to TIM1, increased configTOTAL_HEAP_SIZE to 120000, updating the linker to include .ethernet_data, externing NetworkInterface.c’s references to the DMARxDscrTab (and Tx) in main.c and indicating the linker section for .ethernet_data and verified it from the .map file. Here’s my repo: GitHub - diydsp/safertos_tcptest: testing SafeRTOS/FreeRTOS with FreeRTOS-plus-TCP
I can compile successfully and step through the initial creation of tasks, but I’m stuck on the networking when I attempt to ping my board. I see the ARP packet enter, get processed and the proper functions call to transmit it, but no packet comes out? Any advice?

My setup:
I set up a point to point network from my laptop’s wired ethernet to the Nucleo with the linux command:
sudo ip a add 192.168.2.1/24 dev enx00e04c680286
Note, there’s nothing else on that network, just the cable between the laptop and the dev board. I see the green link light, and the orange light is steady on, except for going dark about once per second when pinging:
ping 192.168.2.2

Here’s my config from main.c:

static uint8_t ucMACAddress[ 6 ] = { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55 };
static const uint8_t ucIPAddress[ 4 ]        = { 192, 168,   2,   2 };
static const uint8_t ucNetMask[ 4 ]          = { 255, 255, 255,   0 };
static const uint8_t ucGatewayAddress[ 4 ]   = { 192, 168,   2,   1 };
static const uint8_t ucDNSServerAddress[ 4 ] = { 208,  67, 222, 222 };

When I ping the ethernet on my Nucleo, I can see the request come through with wireshark, and I see the packet get processed inside FreeRTOS, all the way up through the point where it should be re-transmitted. However, the ping ends with “host unreachable,” and I don’t see a packet transmitted from the dev board with wireshark.

Here is an outline of key points I see stepping through the code on the ping receive. However, after the WRITE_REG, no packet is visible on wireshark :frowning: Any advice?

prvProcessEthernetPacket()
            switch( pxEthernetHeader->usFrameType )
                case ipARP_FRAME_TYPE:
                        eReturned = eARPProcessPacket( ipCAST_PTR_TO_TYPE_PTR( ARPPacket_t, pxNetworkBuffer->pucEthernetBuffer ) );
    ( void ) memcpy( pvCopyDest, pvCopySource, sizeof( ulSenderProtocolAddress ) );        
uint32_t ulHostEndianProtocolAddr = FreeRTOS_ntohl( ulSenderProtocolAddress );
traceARP_PACKET_RECEIVED();
                        iptraceSENDING_ARP_REPLY( ulSenderProtocolAddress );
                        eReturn = eReturnEthernetFrame;
        case eReturnEthernetFrame:

            vReturnEthernetFrame( pxNetworkBuffer, pdTRUE );
            case ipARP_FRAME_TYPE:
                pvCopySource = &pxEthernetHeader->xSourceAddress;

        ( void ) xNetworkInterfaceOutput( pxNetworkBuffer, xReleaseAfterSend );

                if( HAL_ETH_Transmit_IT( &( xEthHandle ), &( xTxConfig ) ) == HAL_OK )
                if( ETH_Prepare_Tx_Descriptors( heth, pTxConfig, 1 ) != HAL_ETH_ERROR_NONE )
                WRITE_REG( heth->Instance->DMACTDTPR, ( uint32_t ) ( heth->TxDescList.TxDesc[ heth->TxDescList.CurTxDesc ] ) );
                    xResult = pdPASS;

Thank you.

Hi @diydsp

Which part in the switch statement does the code get to? Also, are you able to provide the wireshark pcap of the packets? There’s instructions on how to do this here. I’ve contacted our TCP experts, they’ll chime in soon too.

Hello @diydsp,

Welcome to FreeRTOS forums.

I see that everything seems to work properly in the code. The trace that you showed seems to be correct. Can you also try sending some packets out of the device on your own (see sending a ping using freertos)?

Also, a wireshark log would be much appreciated :slight_smile:.

Thanks

Thank you @diydsp for this well-prepared and very clear report. I cloned your repo and I will try to find where things go wrong. One moment please.

Hi all, thank you for responding so quickly :slight_smile: I spent more time on it and compared it to the example found here for Lwip: (Found at the bottom of this page)

I found a few issues, first is embarrassing: I had been editing the RAM version of the linker file, not the FLASH one. For some reason, I thought the RAM one held stuff related to RAM, that was just a mistake. And when I said I verified the DMA buffers, I had made a mistake. So I fixed that.

The other big difference I found was the default GPIO config for the Eth interface in STMCubeMX was slightly off for the H745ZI-Q board! The pin selections for TXD0 and… get this: TX_EN were off! That’s why I couldn’t transmit! The transmit_enable was not enabled. So I matched my configuration to the one found in the Lwip reference project. The other difference is that those default GPIOs were also set for low-speed, so I moved them all to very high-speed, like the Lwip proj.

Another difference I found between the two is the comments in NetworkInterface.c say that the DMA needs to be placed at 0x24040000, but the Lwip does it at 0x3004000 and my project works at 0x3004000. I haven’t tried to move it back to 0x24040000, but from what I glanced at in the manual (caution 3.5k page PDF) on page 136, it seems 0x24040000 is special AXI SRAM and it didn’t appear Ethernet was connected to that except for an AHB-AXI bridge?! don’t know this part well.

And now it’s working, woo-hoo! I can ping! My average ping time is about 0.7ms. Does that sound about right? I was hoping it would be faster to move between my laptop and dev board so if there’s anything I’m missing to speed it up, let me know. Also I updated my repo.

My next task is to move some TCP data around, so I’m building SimpleTCPEchoServer.c into my code. So far it freezes with iptraceFAILED_TO_OBTAIN_NETWORK_BUFFER when pvPortMalloc fails on its second allocation… So I’ll be looking into that next.

Hello,

I am glad that you were able to figure out the issue and fix it! :slight_smile:. Yes, GPIO configuration is a big part of switching/porting to a new hardware.

0.7 ms sounds fine to me. When I hardwire my laptop with my router, I get 0.6-0.8 ms time to ping.

I would step through the code to see how much is the initial heap space and how much is left when malloc fails. If the heap space is running out too soon, is there a memory leak?

Hello @diydsp , you’re a hero! Very good that you found the problem your self. I had already seen that your linker file (.LD) missed the Ethernet section.

I had put the section .ethernet_data in AXI ram;

  /* .ethernet_data declared here. */
  AXI_RAM (xrw)  : ORIGIN = 0x24000000, LENGTH = 512K

  .ethernet_data :
  {
    PROVIDE_HIDDEN (__ethernet_data_start = .);
    KEEP (*(SORT(.ethernet_data.*)))
    KEEP (*(.ethernet_data*))
    PROVIDE_HIDDEN (__ethernet_data_end = .);
  } >AXI_RAM

Running ICMP/ping doesn’t have a high priority within a host like Windows or Linux. When I ping between two FreeRTOS+TCP devices, the measured time is only a quarter of what you measured. The time will even drop further when I omit the logging.

Wait until you get iperf3 running: you will be amazed about its TCP performance on a 100 Mbps LAN.

I played a lot with the memory configuration like ETH_RX_DESC_CNT and ETH_TX_DESC_CNT.

and with the configuration for the TCP iperf sockets:

#define ipconfigIPERF_TX_BUFSIZE     ( 24 * ipconfigTCP_MSS )
#define ipconfigIPERF_TX_WINSIZE     ( 12 )
#define ipconfigIPERF_RX_BUFSIZE     ( 24 * ipconfigTCP_MSS )
#define ipconfigIPERF_RX_WINSIZE     ( 12 )

The above settings are good if you want to win the race. Normally you will assign less memory for a little less performance.

If it is interesting to you, find here my STM32H7 project. You find an iperf3 server here

PS. Could you mark your post as “solved” please?

EDIT
If you have a native Linux machine, i.e. not a Linux VM, you can try “sudo ping -f 192.168.2.2”
This flood ping" will exchange as many pings as possible and report a lost-count. This can be useful when testing a new configuration.

1 Like

Hiya,
I’ve now got iperf3 running from my laptop to my STM745ZI-Q Nucleo. Wow this is super fast! Yay! Great work :slight_smile:

My numbers are below. I’m using @htibosch’s suggested WIN/BUFSIZEs. Would you say these are typical? BTW I switched the DMA DscrTab’s back and forth between 0x2404000 and 0x3004000 and saw no difference in performance. What other levers are there to pull to optimize performance?

Also, can you tell me more about disabling the D-Cache? For the rest of my application I would strongly desire to utilize the D-Cache. What is limiting it from being enabled? I can verify that turning it on stops pings, echo and iperf3 from working. How feasible is it to make this library work with DCache enabled? Any plans for that to change in the near future? Are the some MPU settings to make it work?

Max reliable UDP speed = 14 Mbps, although I may be able to allocate more RAM and see if it can go faster. It eventually ends up at “iptraceFAILED_TO_OBTAIN_NETWORK_BUFFER”
Max TCP speed = 13.5 Mbps.

~/STM32CubeIDE/workspace_2_0/test_04$ iperf3 -V -c 192.168.1.10 --port 5001 --bytes 20M
iperf 3.7
Linux nvawter-tempo 5.11.0-46-generic #51~20.04.1-Ubuntu SMP Fri Jan 7 06:51:40 UTC 2022 x86_64
Control connection MSS 1460
Time: Tue, 25 Jan 2022 20:39:02 GMT
Connecting to host 192.168.1.10, port 5001
Cookie: ijgtor2raev66zfmjdadivea7guawxc3pr63
TCP MSS: 1460 (default)
[ 5] local 192.168.1.1 port 46348 connected to 192.168.1.10 port 5001
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 20971520 bytes to send, tos 0
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.78 MBytes 15.0 Mbits/sec 0 47.1 KBytes
[ 5] 1.00-2.00 sec 1.50 MBytes 12.5 Mbits/sec 0 47.1 KBytes
[ 5] 2.00-3.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 3.00-4.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 4.00-5.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 5.00-6.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 6.00-7.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 7.00-8.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 8.00-9.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 9.00-10.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 10.00-11.00 sec 1.66 MBytes 13.9 Mbits/sec 0 47.1 KBytes
[ 5] 11.00-12.00 sec 1.58 MBytes 13.2 Mbits/sec 0 47.1 KBytes
[ 5] 12.00-12.53 sec 898 KBytes 13.9 Mbits/sec 0 47.1 KBytes


Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-12.53 sec 20.0 MBytes 13.4 Mbits/sec 0 sender
[ 5] 0.00-12.53 sec 19.8 MBytes 13.3 Mbits/sec receiver
CPU Utilization: local/sender 0.2% (0.0%u/0.2%s), remote/receiver 0.0% (0.0%u/0.0%s)
snd_tcp_congestion cubic

iperf Done.

~/STM32CubeIDE/workspace_2_0/test_04$ iperf3 -V -c 192.168.1.10 --port 5001 --bytes 20M -u -b 14M
iperf 3.7
Linux nvawter-tempo 5.11.0-46-generic #51~20.04.1-Ubuntu SMP Fri Jan 7 06:51:40 UTC 2022 x86_64
Control connection MSS 1460
Setting UDP block size to 1460
Time: Tue, 25 Jan 2022 20:48:17 GMT
Connecting to host 192.168.1.10, port 5001
Cookie: kzdygqjqm7ed47szikcb3vuoiysbf3qfnlqs
Target Bitrate: 14000000
[ 5] local 192.168.1.1 port 36028 connected to 192.168.1.10 port 5001
Starting Test: protocol: UDP, 1 streams, 1460 byte blocks, omitting 0 seconds, 20971520 bytes to send, tos 0
[ ID] Interval Transfer Bitrate Total Datagrams
[ 5] 0.00-1.00 sec 1.67 MBytes 14.0 Mbits/sec 1198
[ 5] 1.00-2.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 2.00-3.00 sec 1.67 MBytes 14.0 Mbits/sec 1198
[ 5] 3.00-4.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 4.00-5.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 5.00-6.00 sec 1.67 MBytes 14.0 Mbits/sec 1198
[ 5] 6.00-7.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 7.00-8.00 sec 1.67 MBytes 14.0 Mbits/sec 1198
[ 5] 8.00-9.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 9.00-10.00 sec 1.67 MBytes 14.0 Mbits/sec 1199
[ 5] 10.00-11.00 sec 1.67 MBytes 14.0 Mbits/sec 1198
^C[ 5] 11.98-21.70 sec 0.00 Bytes 0.00 bits/sec 0


Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-21.70 sec 20.0 MBytes 7.73 Mbits/sec 0.000 ms 0/14365 (0%) sender
[ 5] 0.00-21.70 sec 0.00 Bytes 0.00 bits/sec 0.000 ms 0/0 (0%) receiver
CPU Utilization: local/sender 46.3% (9.6%u/36.7%s), remote/receiver 0.0% (0.0%u/0.0%s)
iperf3: interrupt - the client has terminated

Hi @diydsp, thanks for reporting back.

You run the iperf3 client on Linux. Is that a genuine Linux computer, or a virtual machine?
I ask this because iperf should show much higher numbers. I just had these speeds:

C:\> iperf3 -c 192.168.2.114 --port 5001 --bytes 100M
Connecting to host 192.168.2.114, port 5001
[  4] local 192.168.2.16 port 3885 connected to 192.168.2.114 port 5001
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.01   sec  9.50 MBytes  79.1 Mbits/sec
[  4]   1.01-2.00   sec  10.2 MBytes  86.6 Mbits/sec
[  4]   2.00-3.00   sec  10.5 MBytes  87.9 Mbits/sec
[  4]   3.00-4.00   sec  10.2 MBytes  86.1 Mbits/sec
[  4]   4.00-5.01   sec  10.4 MBytes  86.6 Mbits/sec

C:\> iperf3 -c 192.168.2.114 --port 5001 --bytes 100M -R
Connecting to host 192.168.2.114, port 5001
Reverse mode, remote host 192.168.2.114 is sending
[  4] local 192.168.2.16 port 7698 connected to 192.168.2.114 port 5001
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  10.6 MBytes  88.9 Mbits/sec
[  4]   1.00-2.00   sec  10.7 MBytes  90.0 Mbits/sec
[  4]   2.00-3.00   sec  11.0 MBytes  92.0 Mbits/sec
[  4]   3.00-4.00   sec  10.8 MBytes  90.7 Mbits/sec
[  4]   4.00-5.00   sec  11.0 MBytes  92.2 Mbits/sec

My Windows laptop is on 192.168.2.16.

Now I will check what changes it needs to allow data caching.

@diydsp, I found a way to use Ethernet while D-Cache is enabled.

I wrote:

Now I will check what changes it needs to allow data caching.

First I used the traditional way of cleaning / invalidating individual cache-lines. That worked until I found out that the length of a DMA descriptor is not a multiple of a cache-line ( 24 in stead of 32 bytes ).
That makes it very difficult because there is always overlap.

Then I looked up how to disable D-Cache for a specific region of memory. Here is what I did: stm32H7_disable_dcache.c (1.6 KB)

When testing: remember to disable all debug features that affect speed, such as stack checking, and use compiler optimisation such as -Os or -O2.

I am curious if it also works for you.

Hi, wow, 80-90 Mbps: astonishing! I tried it on a peppy Win10 box and got the same results as my linux machine. Yes, it is a real machine, not a VM. and the win10 was also a real one.

Then I saw your post below and remembered optimization… I was running with -O0… so I increased it to -O1, -O2, -O3, and -Ofast. That was helpful - the fastest I can reach now is 35 Mbps. And that’s the same for win and linux and for two different USB-C-based ethernet interfaces.

So that’s pretty darn impressive… but if anyone has advice on where to look to make it as awesome as @htibosch 's system, I would appreciate it. I turned on link-time optimization, but that didn’t change anything… Perhaps something in the FreeRTOS config? Any suggestions there? I’m going to import @htibosch 's STM32H747_cube project now and scan through the settings there.

Mainly I’m needing to reduce latency to the absolute minimum. I’m assuming that when I have the highest throughput, my build/configuration will be correct and applicable to smallest latency… but that is an assumption. Also, has anyone got any suggestions for how to test UDP latency?

(p.s. the D-Cache stuff is amazing, too. I’ll dig into that after I skim through the STM32H747_cube_second.ioc)

Between my laptop and the board there is a 1 Gbps switch. Between the switch and the board, the speed is 100 Mbps. Not sure about your connection?

Note that running WireShark on your laptop may slow down Ethernet communication.

In this directory, you find the configuration files that I used during the test.

Here is the latest Ethernet driver for STM32H7xx. Maybe there was a small change with a big effect.

And finally, these are the symbols that I defined for the compiler:

USE_HAL_DRIVER
STM32H747xx
STM32H7xx
CORE_CM7
DEBUG
ipconfigUSE_TCP_MEM_STATS=0
STATIC_LOG_MEMORY=1
SYMBOLS_USED=1