TCP socket dying

friesen wrote on Wednesday, August 22, 2018:

I am using FreeRTOS + TCP with a wilc1000 wifi module. I find in certain situations that if network connectivity is lost (In this case by doing an SSID scan) that a connected socket is dying, eventually giving me a pdFREERTOS_ERRNO_ENOTCONN. I suspect that the network is down for around 500ms. While I’ll have to work on fixing that issue, I’d really like to understand why a socket would quit from this. Is there some setting I have overlooked int FreeRTOSIPConfig.h that I may have missed? I’ll dig deeper, but my first guess is that all the retries are finishing too soon.

friesen wrote on Wednesday, August 22, 2018:

I am using windowing. I have attached my config.

heinbali01 wrote on Thursday, August 23, 2018:

Hi Erik, I don’t know who has provided a FreeRTOS+TCP driver for your WiFi chip and what it does with the TCP/IP stack in case it does an SSID scan, or in case the connection gets lost?

Looking at your FreeRTOSIPConfig.h, I see that you are using TCP keep-alive messages:

/* Include support for TCP keep-alive messages. */
#define ipconfigTCP_KEEP_ALIVE                   ( 1 )
#define ipconfigTCP_KEEP_ALIVE_INTERVAL          ( 20 ) /* in seconds */

When using keep-alive messages, every 20 seconds a dummy packet will be sent. When it won’t be replied to three times, you can expect an error when using the socket: -pdFREERTOS_ERRNO_ENOTCONN.
You can make this time longer of course.

In fact the timing is like this:

    0 sec packet-1  ...  no reply
   20 sec packet-2  ...  no reply
   23 sec packet-3  ...  no reply
   26 sec "connection closed"

In my +TCP drivers I tend to call the following function only once, e.g. right after a Link Status has been detected:

    vApplicationIPNetworkEventHook( eNetworkUp );

I never have it called ( with eNetworkDown ), even when the Link Status of the connection goes down.
So unless there is a long disconnection, longer than 26 seconds, the TCP connections will remain to exist.
As soon as the Link Status becomes high again, the communication will resume.

As long as the Link Status is low, all outgoing packets passed to xNetworkInterfaceOutput() must be discarded, so that function should test if it is OK to send a packet.

I hope that this makes it clear?

friesen wrote on Wednesday, September 26, 2018:

I have used the ASF framework, and used the Amazon Freertos + tcp example, and built my own combination, which includes wireshark to usb stick.

The whole thing works perfect in when it has a good connection, but its almost like the ip stack chokes if it gets a bit behind on tcp processing, I’m not sure. I’m not even sure how/where to log this properly.

To clarify a bit on the type of connection, it is a mp3 stream, so the tcp keep alives probably aren’t even a factor here as the stream source never quits.

Other concurrent tcp sockets stay alive for the duration.

friesen wrote on Wednesday, September 26, 2018:

I am on version 2.0.1, I do see some stuff in the changelog that may be pertinent.

heinbali01 wrote on Thursday, September 27, 2018:

which includes wireshark to usb stick.

What do you mean with this? Do you mean that your application writes PCAP’s to a USB stick?

The whole thing works perfect in when it has a good connection,
but its almost like the ip stack chokes if it gets a bit behind
on tcp processing,

As it’s an audio stream, I would recommend to use an RX buffer size that is big enough to hold at least 5 seconds op audio ( preferably more ). I hope you have enough RAM for that. Before you start playing a track, wait until the buffer is filled up more than 80%.
In FreeRTOS_Sockets.h you find some functions that allow you to inspect the buffer level, like FreeRTOS_recvcount().

Once playing, as a test, I would keep on monitoring the RX buffer: see if it remains between acceptable limits.

I’m not sure. I’m not even sure how/where to log this properly.

What you can do is produce a PCAP, compress it and attach it to your post. Make sure that the packets are filtered, with e.g. ip.addr==192.168.2.100.
Before doing so, you may want to play with the RX buffer size and monitor it.

…it is a mp3 stream, so the tcp keep alives probably aren’t
even a factor here as the stream source never quits.

That is true: as long as a socket receives data of any kind, it is assumed that the peer is still alive, and no keep-alive messages are being sent.

I would like to see your FreeRTOSIPConfig.h, and see how you configure the socket that transports MP3 data.
Have you used FREERTOS_SO_WIN_PROPERTIES to set it’s buffer properties?

friesen wrote on Thursday, September 27, 2018:

That is right, I can write PCAP’s to a USB drive.

It would appear that 2.0.7 is doing better.

I am setting this socket up with

	    WinProperties_t  xWinProps = {
		.lRxBufSize = 131072,
		.lRxWinSize = 32,
		.lTxBufSize = 5000,
		.lTxWinSize = 2
	    };
        
        	    if (FreeRTOS_setsockopt(hTCP, 0, FREERTOS_SO_WIN_PROPERTIES, &xWinProps, sizeof ( xWinProps))) {
		SysLog(LOG_DEBUG, "PL: set FREERTOS_SO_WIN_PROPERTIES failure\r\n");
		UpdatePlayState("Sys error 2");
		playtask = player_error;
		return;
	    }

I have plenty of memory here, so it is more about setting the optimal values for tcp. Would it help recovery to have a larger window?

The audio buffer is a couple minutes large.

I have a capture here aercon.net/Public/StreamSample.pcap

That example is what the wifi interface sees. This test was with very poor signal, the antenna off the router.

heinbali01 wrote on Friday, September 28, 2018:

Hi Erik,

That is right, I can write PCAP’s to a USB drive.

Very good, it can come handy. I sometimes write a PCAP on a RAM-disk.

It would appear that 2.0.7 is doing better.

It has lot of improvements, mostly on security and boundary checks.
For instance, there is a better checking of the correctness of DHCP or DNS packets.

I am setting this socket up with

> WinProperties_t  xWinProps = {
>     .lRxBufSize = 131072,
>     .lRxWinSize = 32,
>     .lTxBufSize = 5000,
>     .lTxWinSize = 2
> };

When your MP3 audio has 320 Kbps, and if you want 10 seconds of buffer space, you will need 400 KByte of buffering.

I have plenty of memory here, so it is more about setting the optimal values for tcp.

Lucky you! So yes, try a 10-second buffer or more, and before playing, wait until the buffer is filled about 80%.

Would it help recovery to have a larger window?

No.

A lRxWinSize of 32 means that you can have 32 outstanding packets ( of in your case 1400 bytes ).
I would use less. On a fast LAN/WiFi and with fast CPU’s, you will have no more than 10 outstanding packets.

On a WAN, when the quality of an ISP is less than optimal, a value of 4 would be optimal.
FreeRTOS+TCP had a mechanism which will make the window very small when there are too many errors.

A large TCP window and a bad transporter = complete chaos :slight_smile: In the worst case, a windows size of 2 packets is optimal.

I would propose these settings:

WinProperties_t  xWinProps = {
	.lRxBufSize = 409600,				// 10 seconds of MP3 audio
	.lRxWinSize = 10,					// At most 10 outstanding packets
	.lTxBufSize = 2 * ipconfigTCP_MSS,	// There is no returning data
	.lTxWinSize = 2
};

The audio buffer is a couple minutes large.

So much won’t be necessary. The socket is already buffering audio data. Why not reserve 409,600 bytes as well?

If I were your I would produce some statistics of the buffer spaces while playing audio. Write every second one record of how much both buffers are filled.

I have a capture here http://aercon.net/Public/StreamSample.pcap

Thanks. For readability, I truncated it to a smaller file ( record 23150 and further ), attached below. ( as pcap_erik.zip )

I guess you have a very fast CPU!

You see that most of the time the actual TCP window is only filled with 1 segment?

There are some retransmissions and delays of up to 20 seconds.

Packet 11 is out of order. It had already received and acknowledged.

Packet 15 one packet is missing:
          expected  9801
          received 11201
          The response is correct: ACK=9801 SLE=11201 SRE=12601

After packet, packets are coming in very slowly: delays of 6 and 20 seconds

Packet 21 is the expected retransmission of 9801, good

Packet 25 two packets is missing:
          expected 16801
          received 30801
          The response is correct: ACK=16801 SLE=30801 SRE=32201

Packet 21 is the expected retransmission of 16801, good

Packet 27 a retransmission comes after a delay of 20.7 seconds

Packet 29 thirteen packets is missing:
          expected 18201
          received 36401 ( missing 13 packets )
          The response is correct: ACK=18201 SLE=36401 SRE=37801

Packet 52 TCP has fully recovered 

Packet 99 a long awaited packet is retransmitted after 20 seconds

Packet 101 After 30 seconds of silence, +TCP is giving up on this connection

Summary: nothing wrong with the +TCP responses. The big delays cause a problem: delays of 20 up to 30 seconds.

That example is what the wifi interface sees. This test was with
very poor signal, the antenna off the router.

Ah, that is why.

It looks like WiFi got lost sometimes and needed to re-established which takes about 20 seconds.