FreeRTOS+TCP Multi: Issues with multiple endpoints on one interface

@tony-josi-aws

I am attaching the python tcp client (had to change the extension to upload)
tcp_client.py.c (891 Bytes)

I’ll take a look at how to make my logging thread safe.
In the mean time I can maybe comment out my print statements and also update the endpoints so that the gateway is the same.

As far as running just 1 TCP server, I’ll have to report back.
The time it takes for the TCP server going down is variable (could be 30 sec, 30 min, or 30 hours, etc.), so I’m not sure if 1 TCP server will stay up indefinitely.

@tony-josi-aws

I first removed the printf statements to see if that would make a difference.
I did so by making vLoggingPrintf do nothing:

void vLoggingPrintf2( const char * pcFormat, ... )
{
    return;
}

Unfortunately that did not help the situation.

I then made the gateway address the same for both endpoints:

static const uint8_t ucGatewayAddress1[ 4 ] = { 10, 100, 100, 1 };
static const uint8_t ucGatewayAddress2[ 4 ] = { 10, 100, 100, 1 };

And that did not help the situation either. One of the TCP servers dropped out.

The second TCP server is still going. I’ll let it run overnight to see if it eventually drops out also, or if just having one server running is ok.

One more thing, I’m wondering if one of my other function stubs in main.c be an issue?

EDIT #1:
I was doing some experimenting and discovered something kind of interesting.
What I’m trying to do is use prvConnectionListeningTask() to create two TCP servers:

TCP Server1: 10.100.100.162 port 5002
TCP Server2: 10.111.111.163 port 5003

I have two linux machines in my lab. Let’s say I use a dual IP address on Linux Machine 1 and create the two TCP servers on the TCP server addresses and ports above.
If on Linux machine #2 I have a TCP client on 10.100.100.151, it can only connect to 10.100.100.162:5002. If I try connecting using port 5003 (i.e. connecting to 10.100.100.162:5003), then I get a connection error.

However, it is different behavior when using FreeRTOS+TCP.
If the Zynq has both TCP servers created above (10.100.100.162:5002 and 10.111.111.163:5003), I can use a Linux machine as a TCP client to connect to 10.100.100.162 on port 5003 or port 5002. No connection error here!

This is not what I was expecting. Why did it not give me a connection error?

Could this give us some insight as to why one of the TCP servers goes out?

Edit #2
I decided to do some more experimentation.
I modified my application to only have one endpoint.
I then created two TCP server threads:
TCP Server 1: 10.100.100.162 Port 5002
TCP Server 2: 10.100.100.162 Port 5003

I did similar experiment as before: two clients connecting to those two TCP servers
Rpi151 (10.100.100.151) as TCP Client 1 connecting to 10.100.100.162 Port 5002
Rpi152 (10.100.100.152) as TCP Client 2 connecting to 10.100.100.162 Port 5003

So far both TCP Server threads have been running without any issues.
(80 min so far and counting with no issues).
So there is something different about having that second endpoint.

Edit #3
One more experiment. I went back to creating two endpoints. However, this time I only create two TCP servers on the first endpoint (10.100.100.162).
Been running the TCP servers and clients for 40min so far and things appear to be stable.

Edit #3a
This is an update to the experiment in edit #3.
I let the TCP servers run overnight.
After 4829 sec (approx 1hr 20min), one of the TCP servers disconnected its TCP client.
The other TCP server and client are still up and running 42348 sec and counting (approx 11.5hrs and counting).
Note that this is with making the vLoggingPrintf do nothing (just return).

Edit #3b
Ran experiment from edit #3 yet again.
One TCP Server disconnected it’s client after 750 sec (approx 12 min).
The second TCP Server still running.

Edit #3c
Again I’m been experimenting around. I found another interesting find.
So after creating the two endpoints I only created the two TCP servers on endpoint 1, or so I thought.
TCP Server 1: 10.100.100.162 Port 5002
TCP Server 2: 10.100.100.162 Port 5003

However, on the Raspberry Pi I was able to create a TCP client that could connect to:
TCP Server 1: 10.111.111.163 Port 5002
TCP Server 2: 10.111.111.163 Port 5003

Very odd that the TCP server binding is so loose, if that makes sense.
Is there something wrong with FreeRTOS_bind() when there are multiple endpoints?

@tony-josi-aws

Back to one of your original questions.
With both endpoints, and running two TCP servers, after a certain amount of variable time one of the TCP servers ends the connection with it’s client (or the endpoint appears to stop working for a bit).
The other TCP server that did not quit appears to keep going indefinitely.
I left it overnight and that surviving TCP server has been going so far for 14 hours no issues.

So again, issue is when having two endpoints AND multiple TCP servers.

@svgarcia

Thanks for sharing your findings.

I am attaching the python tcp client (had to change the extension to upload)
tcp_client.py.c (891 Bytes)

Your client implementation script looks good to me.

EDIT #1:
I can use a Linux machine as a TCP client to connect to 10.100.100.162 on port 5003 or port 5002. No connection error here!

FreeRTOS+TCP doesn’t bind ports per endpoint, meaning that all ports are shared system-wide. Within FreeRTOS+TCP, the IP address is ignored when calling bind. E.g., Port 80 can be bound only once, system-wide. This choice is based on the fact that, being an embedded TCP/IP stack, the focus is more on supporting resource-constrained devices.

Also note that the endpoints in FreeRTOS+TCP cannot be compared equal to endpoints/similar in Linux or Windows, which support bridging. FreeRTOS+TCP endpoints don’t support network bridging.

Edit #2
I modified my application to only have one endpoint.
I then created two TCP server threads:
TCP Server 1: 10.100.100.162 Port 5002
TCP Server 2: 10.100.100.162 Port 5003

I did similar experiment as before: two clients connecting to those two TCP servers
Rpi151 (10.100.100.151) as TCP Client 1 connecting to 10.100.100.162 Port 5002
Rpi152 (10.100.100.152) as TCP Client 2 connecting to 10.100.100.162 Port 5003

So far both TCP Server threads have been running without any issues.

This seems to prove that the application code with multiple TCP servers appears to be working fine. I’m wondering if your network infrastructure expects FreeRTOS+TCP to bridge packets between 2 endpoints? If that is the case, it explains why your server client connection is dropped after a while.

Edit #3
Is there something wrong with FreeRTOS_bind() when there are multiple endpoints?

This is explained by the similar case in EDIT #1.

@tony-josi-aws

Thanks for your reply.
I ended up doing two more experiments and discovered some interesting findings.

Experiment #1:
My network configuration is still the Rpi151, Rpi152, and the Zynq.
All devices have dual IP addresses:
Rpi151: 10.100.100.151 and 10.111.111.151
Rpi152: 10.100.100.152 and 10.111.111.152
Zynq: 10.100.100.162 and 10.111.111.163

The difference for experiment #1 is that I replaced the ethernet switch with a router.
Changing to a router did not make a difference. One of the TCP servers ends the connection after a certain amount of time.

Experiment #2
I went back to using a switch to connect all the devices together.
However this time, I decided to make the Rpi devices only have 1 IP address, and having only the Zynq have a dual IP address.

So configuration was
Rpi151: 10.100.100.151 only
Rpi152: 10.111.111.152 only
Zynq: 10.100.100.162 and 10.111.111.163
So far both TCP server/client pairs have been working flawlessly for over 5 hours.

I’m starting to believe that maybe the Rpi’s having a dual IP address was confusing FreeRTOS+TCP. So maybe any device that wants to talk with an embedded platform running FreeRTOS+TCP should have only 1 IP address. Otherwise FreeRTOS+TCP will get confused.
What’s your thoughts on this? Is this due to FreeRTOS+TCP not supporting network bridging?
Could you elaborate a little bit more on what you mean by network bridging as well as elaborate on FreeRTOS+TCP not supporting this feature or on the limitations of FreeRTOS+TCP in my configuration?
I read up on Network bridges but only thing I found online was network bridges as a separate device that connects 2 LANS.

Edit#1
I wonder if the router made any difference or if the router just acted like my switch.
The router I used has 4 ethernet ports on the back to which I connected the Rpi(s) and the Zynq to.

I also wonder if I’m making the right conclusion from Experiment #2.
Before I said that the Zynq got confused when communicating to a Rpi with dual IP.
However, maybe b/c the Zynq is different, perhaps it is the switch which gets confused when dealing with an embedded device? I wonder if using a hub would make a difference?

I pointed out very early in this thread that multiple IP addresses mapped to the same MAC address are inconsistent with many routing/switching architectures. Might have saved you a lot of time if you had considered that piece of advice.

Again, I would recommend taking wireshark traces on both ends (ie the RPi as well as your target) and then comparing the packet flow. I wouldn’t be the least surprised i you discovered that the packets got lost/dropped not in the sentinel devices but the switching/routing infra structure in between.

@RAc
I think using an Ethernet Hub will resolve that question of whether it was the Switch that was getting confused or whether it was FreeRTOS+TCP that was getting confused.

In a few days I will try with a hub and report back.

if you are able to get hold of a “dumb hub,” please let me know. I will make you an offer for it that you can not resist. They are great for network monitoring, but you can not get any of these anymore. Even the cheapest radio shack home use boxes have some degree of routing/switching in them.

@RAc
Look up “Netgear EN104TP” on ebay. I just bought one a few min ago and there appears to be plenty of sellers on there. That’s why I said I need a few days to test b/c I’m waiting for my hub to come in the mail.

@tony-josi-aws
After a long wait, I was finally was able to get a “dumb” Ethernet hub connecting my network, which allowed me to get Wireshark network captures.

(1) Dual IP address on the Rpi still confuses the Zynq and eventually one of the TCP servers ends the connection.
(2) If Rpis instead are only single IP address then TCP servers / clients keep talking indefinitely.

As I mentioned previously this is the same behavior I was seeing with both the Ethernet swtich and the Ethernet Router. The Ethernet hub made no difference. Hence I wonder if there is something that FreeRTOS+TCP doesn’t like when it talks to someone that has a dual IP address as well.

I have various PCAP files that show the network traffic and how FreeRTOS+TCP is the one that all of a sudden ends the TCP Server connection. (See pcap_files zip file below).

NOTE1: In all trials below I had two TCP servers running on the Zynq and one TCP client running on each Rpi (Rpi_151 and Rpi_152)
Rpi_151: 10.100.100.151 and 10.111.111.151
Rpi_152: 10.100.100.152 and 10.111.111.152
Zynq: 10.100.100.162 and 10.111.111.163

NOTE2: Rpi_151 has TCP client on 10.100.100.151 communicating to Zynq TCP server on 10.100.100.162

NOTE3: Rpi_152 has TCP client on 10.111.111.152 communicating to Zynq TCP server on 10.111.111.163

Name: trial_1.pcapng
Description: Rpi_151 was able to send 662 messages. When it sent message 663 the Zynq sent a rst message. Rpi_152 kept communicating with Zynq.

Name: trial_2.pcapng
Description: Similar to trial 1 above. Rpi_151 was able to send 15 messages. When it sent message 16 the Zynq sent a rst message. Rpi_152 kept communicating with Zynq.
Also of note is some retransmissions in the pcap file

Name: trial_3.pcapng
Description: Similar to trials above except this time Rpi_151 keept going but Rpi_152 got rst message from Zynq.

Attached are the pcap files in a zip file
pcap_files.zip (5.1 KB)

@svgarcia

Thanks for sharing the PCAP files. Do you also have any Zynq device logs on what’s triggering the RST?

Also, there is a new community PR: Endpoint mismatch in FreeRTOS_MatchingEndpoint by ravitd · Pull Request #1239 · FreeRTOS/FreeRTOS-Plus-TCP · GitHub that’s currently under review, which updates the FreeRTOS_MatchingEndpoint when handling the incoming packets. Maybe you can take a look at the PR and see if the behavior is different with those changes in your setup.

@tony-josi-aws

I don’t have any device logs. Is there anywhere in the FreeRTOS+TCP code base you’d recommend adding a log / print statement?

I’ll try out the PR and report back if see any changes.

You can define these macros: ipconfigHAS_DEBUG_PRINTF and ipconfigHAS_PRINTF in FreeRTOSIPConfig.h

Example: FreeRTOS/FreeRTOS-Plus/Demo/FreeRTOS_Plus_TCP_IPv6_Demo/IPv6_Multi_WinSim_demo/FreeRTOSIPConfig.h at main · FreeRTOS/FreeRTOS · GitHub

I’ll try out the PR and report back if see any changes.

Did you observe any difference?

@tony-josi-aws

I did try out the PR changes you pointed me to and I did notice something interesting. It fixed an issue I wasn’t aware about (which I describe below). It didn’t eliminate the original issue I have been reporting, however.

Let me give some context to describe the new issue that the the PR fixed.

I ended making my Rpi’s only have one IP address per Ethernet interface. So in order to get a dual IP address on one of the Rpis I added a USB to Ethernet dongle adapter.

So here’s my updated setup:
Rpi_151
Eth0: 10.100.100.151 (TCP Client)
Eth1: 10.111.111.149 (TCP Server) (dongle)

Linux_152
Eth0: 10.100.100.152 (UDP Server)

Zynq Eth0: 10.100.100.162 and 10.111.111.163
(2) TCP servers,
(1) TCP client (which talks to 10.111.111.149)
(1) UDP client (which talks to 10.100.100.152)

Here’s the new issue I discovered (BEFORE the PR changes):
At startup I noticed that only one of the clients on the Zynq would send messages.
For example the Zynq Client (talking on the 10.111.111.x subnet) would send messages, but the Zynq Client (talking on 10.100.100.x subnet) would not be sending anything on the line (confirmed by looking at wireshark and by the Rpi Server not rcv’ing anything).

I would have to manually open up a terminal on 10.100.100.152 and send a ping to the Zynq and that would somehow jump start the Zynq to start sending messages on the 10.100.100.x subnet.

If I reversed the order of adding endpoints in FreeRTOS+TCP then I’d see the opposite. (i.e. The Zynq Client talking on 10.100.100.x subnet would send messages on startup but then I’d have to manually open up a terminal on 10.111.111.149 and send a ping to the Zynq to get the Zynq Client to start sending messages on the 10.111.111.x subnet).

Needless to say, this was a serious issue, because in my real application I can’t open up a terminal to jump start the Zynq.

However, after I applied the PR request you pointed me to, this particular issue appeared to go away. Now at Zynq starting up, both the Zynq UDP Client (talking to 10.100.100.152) and the Zynq TCP Client (talking to 10.111.111.149) start sending messages.

Regarding my original issue of the Zynq TCP Server ending the connection of a client:
I am still seeing this phenomenon even in my new setup and with the PR. Last trial I did with the PR, the TCP Server on Zynq ended the connection with it’s client after 2646 sec (approx 44min).
Again this is an issue I only see when using multiple endpoints on FreeRTOS+TCP. If I only have one endpoint and I change my setup for all interfaces only use one subnet (i.e. 10.100.100.x) I don’t see any issues.

If I enable the ipconfigHAS_DEBUG_PRINTF and ipconfigHAS_PRINTF, this introduces another issue which I think I’ll open up a new forum thread for.
Basically at startup the console starts spitting out this:
“emacps_check_rx: unable to allocate a Network Buffer” over and over again and the whole system freezes up until I close the Zynq console monitoring terminal.
Basically this line in the FreeRTOS+TCP code gets triggered:

Perhaps it may be that I need to implement your advice and implement the printfdebug function with the queue instead of regular printf. But again, this debug printf issue should probably be a separate thread.

I did look at your source code again, and there may be potential for a (albeit very small but rather unlikely) race condition where your XConnectedSocket in the receiver loop may get recycled by the stack in a wrong moment. Have you logged the socket identifiers and ensured that they are always unique and valid across the lifetime of each worker thread?

I understand that that control flow is reused from existing sample code, but no code has ever been tested and proven error free under all fringe conditions.

Also, I assume that you have removed error checking in your server just for readability but make sure that failures (eg to create a worker thread) are handled appropriately in your software?

edit: The reason why I write this is the error return “not connected” that you receive. It hints at the possibility that an active connection attempts to communicate over a socket that has been closed asynchronously in another thread.