FreeRTOS-TCP performance issues with LPC1768

sdcandy wrote on Friday, November 02, 2018:

Hello,

I am having some performance and reliability issue with the FreeRTOS-TCP stack on a NXP LPC1768.

I am using FreeRTOS v10.1.0 and the NetworkInterface.c file from the download here https://interactive.freertos.org/hc/en-us/community/posts/210030166-FreeRTOS-TCP-Labs-port-and-demo-for-Embedded-Artists-LPC4088-Dev-Kit which in turn requires LPCOpen v2.10.

FreeRTOS is configured with 8 priority levels. The Ethernet Rx Task is created as priority 7 (configMAX_PRIORITIES - 1) and the IP task then becomes priority 6.

The network stack is running and gets an IP address from a DHCP server. If I then ping the device from a PC elsewhere on the network then I get a reply time of 0.3ms.

I am trying to integrate the Minnow WebSockets server (and utlimately SharkSSL security layer) from Real Time Logic. When I start the task that runs the WebSockets server (currently at priority 4), even when no browser is trying to connect the ping response time goes out to between 3 and 5 seconds!

When a browser tries to make the initial connection to the server to get the static content (index.hml, some Javascript files and images) the transfers all start to time out. When I look at the network traffic with Wireshark then I see lots of reset and TCP retransmission entries.

With the debug output enabled in the driver it seems that there are lots of receive interrupts occurring with no data associated.

I am at a loss as to how to debug this from here. I tried looking at the LPC17xx NetworkInterface.c file that ships with FreeRTOS v10.1.0 but it doesn’t appear to compile against the FreeRTOS-TCP and LPCOpen files.

-Andy.

rtel wrote on Friday, November 02, 2018:

I am trying to integrate the Minnow WebSockets server (and utlimately
SharkSSL security layer) from Real Time Logic. When I start the task
that runs the WebSockets server (currently at priority 4), even when no
browser is trying to connect the ping response time goes out to between
3 and 5 seconds!

This seems to be the thing to understand first. If I understand your
description correctly, adding the above referenced library is the only
change made between getting a 0.3ms ping reply and a 3 to 5 second ping
reply. That is so massively long there must be some coarse
optimizations to be done to remove most of that latency before looking
in detail at any performance gains that might be made in the MAC driver
itself or interface between the MAC and the TCP stack itself.

I’m not familiar with the library, but am going to guess it is doing
something that is not very multithreading friendly. Are you able to
trace the execution using something like Percepio Tracealyzer
(FreeRTOS+Trace from Percepio is a sophisticated diagnostic tool for FreeRTOS) or Segger SystemView for FreeRTOS? That
will help you understand where the time is being eaten up to enable a
more targeted debug effort.

sdcandy wrote on Sunday, November 04, 2018:

I’ve integrated the Percepio Traceanalyzer code and captured some traces and lots of the early ones seemed to be filled with nothing more than the system tick. I have disabled the capture of that in the hope it captures more useful stuff instead.

I’m not sure exactly how to interpret what I am looking at but the only thing I can see of any significant duration is listed in the NetEvnt Queue where the IP-Task xQueueReceive blocks for 100ms trying to receive, followed by a failed to receive, followed by it blocking again for 900ms as shown in the attached image. This is something that repeats over and over.

-Andy.

sdcandy wrote on Monday, November 05, 2018:

According to the post here someone else ported this driver to the LPC1768 and appears to have had the same problems I am seeing.

I tried to contact them through the interactive site back in September when I thought we were about to embark on this project but they have not responded to my message.

-Andy.

sdcandy wrote on Monday, November 05, 2018:

I think I may have made some progress with this…

In the handler task that deals with received buffers from the hardware it was checking if the receive queue was empty using an if() and then only processing the first buffer. Changing it to a while(!empty) and therefore processing all the data seems to have got the ping response consistently down to 0.5ms

-Andy.

heinbali01 wrote on Tuesday, November 06, 2018:

Hi Andi,

Changing it to a while(!empty) and therefore processing all the data

That’s indeed the usual way of working: the working task is woken-up from a MAC-interrupt ( either by a task notification or a semaphore ).
The working task checks both queues: reception and transmission, each in a while(!empty) loop.

While working on these queues, new notifications may come in. That is no problem, the next xTaskNotifyWait() will return immediately without blocking.

seems to have got the ping response consistently down to 0.5ms

Great! In case you have Linux: try sudo ping -f <address> to have a continuous flood of ICMP packets.

Here are some items that determine the performance of an Ethernet driver:

  • Filtering unwanted traffic using MAC filters and hash filters.
  • Early filtering: packets that are accepted by the MAC can be filtered by the driver: think of unwanted broadcast packets. It is a waste of time to pass them to the IP-stack if they’re not used later on.
  • Consider using multi-casting in stead of broadcasting. I have seen networks that are littered with IPv4 broadcasts.
  • CRC offloading, unfortunately not available on a LPC1768
  • Use zero-copy methods wherever possible

Normal optimisations:

  • Disable stack checking
  • Use C optimisation level
  • Disable asserts ( if you dare )

You can also check the network performance with iperf3, see this post

sdcandy wrote on Tuesday, November 06, 2018:

Things aren’t quite perfect.

Running a prolonged ping test (64 bytes) will run for several hours before the system locks up.

I have added a bunch of breakpoints to the error handlers in the Ethernet driver to see if it traps in any of those but it doesn’t look like it does. I’ll try and capture the trace buffer the next time it fails to see if that shows where it has gone.

-Andy.

heinbali01 wrote on Wednesday, November 07, 2018:

Running a prolonged ping test (64 bytes)

Are you using the ping flood ( -f ) option for this? Or is it just a slow ping?

run for several hours before the system locks up.

When it is locked up, can’t you press a pause button and see where it hangs out?

Is the system totally locked up, or extremely slow?
Can you show some heart-beat, a LED that is blinking from a user task?

If you look at the NetworkInterface.c of e.g. STM32Fx, you will see that prvEMACHandlerTask() can issue some warnings:

    Network buffers: 4 lowest 2
    TX DMA buffers: lowest 0
    Queue space: lowest 34

It can be useful to monitor these resources: the number of free Network Buffers, the number of available DMA buffers, see if all DMA transmissions are successful, see if there is a ( DMA ) reception overflow ( which can be nasty ).
And also it is wise to monitor the free space on the heap.

While writing this message, I had ping running, talking to a FreeRTOS+TCP device:

    root@ubuntu:~# ping -f 192.168.2.106
    PING 192.168.2.106 (192.168.2.106) 56(84) bytes of data.
    .^C 
    --- 192.168.2.106 ping statistics ---
    1397706 packets transmitted, 1397705 received, 0% packet loss, time 1094968ms
    rtt min/avg/max/mdev = 0.330/0.698/69.936/0.579 ms, pipe 4, ipg/ewma 0.783/0.655 ms

sdcandy wrote on Wednesday, November 07, 2018:

When it gives up it appears that it has hit the hard fault handler but I haven’t worked out what has caused it to get there.

The ping test was just done with a slow ping. I will repeat the test with the -f option and see what happens.

Looking again at the initial issue I raised regarding the Minnow Server. Wireshark shows the HTTP request coming from the browser and an ACK coming back from the device. It then doesn’t seem to send any of the data from the web server.

The support team at Real Time Logic seem to believe that there is a problem with the way that large packets are sent - possibly that the driver just gives up on them.

-Andy.

sdcandy wrote on Wednesday, November 07, 2018:

Running ping with -f shows just one . on the screen then:

PING 192.168.10.184 (192.168.10.184): 56 data bytes
..Request timeout for icmp_seq 3659
..Request timeout for icmp_seq 3835
.Request timeout for icmp_seq 3836

Every request thereafter times out.

EDIT: I have removed the if (bReleaseAfterSend != pdFALSE) from around the call to vReleaseNetworkBufferAndDescriptor(pxNetworkBuffer); so and performance seems to be improving.

Flood ping test now gives:

--- 192.168.10.184 ping statistics ---
893181 packets transmitted, 893086 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.298/0.444/14.706/0.055 ms

-Andy.

sdcandy wrote on Wednesday, November 07, 2018:

Not sure whether this makes any difference but I am using heap_4.c and BufferAllocation_1.c

EDIT: For other reasons, I have changed this to use heap_3.c and forced the heap into one the 16KB AHBSRAM regions in the LPC1768 and behaviour is much the same.

-Andy.

sdcandy wrote on Thursday, November 08, 2018:

I have added some of the warning messages from the STM32Fx driver and I see that the number of network buffers reported by uxGetMinimumFreeNetworkBuffers() does go to 0. Should the network driver be blocking somewhere to wait for one to free up?

-Andy.

heinbali01 wrote on Saturday, November 10, 2018:

When the number of available Network Buffers is getting close to zero, you have a serious problem.

Here are some simple strategies to investigate it:

I would try not to block on the availability, and in stead just drop a packet.

Suppose your driver received a packet, you want to copy it to a Network Buffer, but none is available: I would increase some error-counter and drop the packet.

And for a zero-copy driver: a packet has arrived, but you do not have a new Network Buffer to replace it: leave the current Network Buffer assigned to DMA and drop the packet.

Suppose that xNetworkInterfaceOutput() is called and there is no DMA buffer available within e.g. 10 ms, I would increase a counter and drop it.

At every place where packets are dropped, please update some counter so that you can study the behaviour!

UDP packets are stored in Network Buffers. It is important to have those packets read/consumed by the task that owns the socket. If not, you loose valuable Network Buffers. If the CPU is very busy, it may happen that such a task doesn’t get enough CPU-time to process them.

( please note the macro ipconfigUDP_MAX_RX_PACKETS, that can help here )

In principle, the IP-task should never block on getting resources. One exception is xNetworkInterfaceOutput(), which may wait a few ms for an available DMA buffer. But if it waits for a Network Buffer, in most cases, it will be waiting for itself.

In the end, you do not want a TCP/IP stack in which packets are dropped regularly. The above method is a way of studying the behaviour. Once all parameters are well tuned, packets are rarely dropped and there should always be enough Network Buffers.