AWS MQTT connection dies after some point

system · December 24, 2019, 6:04pm

dmh24 wrote on December 24, 2019:

I’m experiencing an issue with an MQTT connection that works fine up until a certain point.
Once i transfer roughly ~64k of data, the MQTT connection dies with an error like

(MQTT connection 0x8001faa8) Failed to receive PINGRESP within 15000 ms.

As you can see i have increased ping response time greatly, but the value does not matter, it always happens as soon as roughly that amount of data is transferred, which is easiest replicated by kicking off an OTA. I have also reproduced it by subscribing to a topic and sending ascii payloads.

I’m stuck on the 2018.08 release, but I’m curious if this rings any bells to anyone regarding configuration settings I could tweak? This is a custom PIC32MZEF board, and default configs (like IOT_THREAD_DEFAULT_STACK_SIZE), the killer is that the network stack (lwip) is still working, if i open a socket and request data or reconnect MQTT it works fine. I’d like to know the root cause though. Appreciate any insights

system · December 26, 2019, 3:51pm

DanG-AWS wrote on December 26, 2019:

Hello, can you take a network capture and analyze the TCP behavior, in particular, at the time of the timeout? That will obviously be somewhat easier if you can use the Ethernet port on your board.

aggarg · December 26, 2019, 8:21pm

Gaurav-Aggarwal-AWS wrote on December 26, 2019:

In addition, disconnectReason in the life-cycle events can also provide some insight into the reason of disconnect: https://docs.aws.amazon.com/iot/latest/developerguide/life-cycle-events.html

Thanks.

system · December 27, 2019, 1:08pm

dmh24 wrote on December 27, 2019:

Thanks for the ideas - ethernet bringup is not quite done on this board, but I should be able to back port back to the curiosity devboard, then capture ethernet. I’ll try that over the coming weeks.

In the meantime I tried the lifecycles and this is what the disconnected lifecycle has


{
  "clientId": "t2-10522",
  "timestamp": 1577451968069,
  "eventType": "disconnected",
  "clientInitiatedDisconnect": false,
  "sessionIdentifier": "09fbbb6d-cbad-4347-9f91-3505073523c6",
  "principalIdentifier": "d4dff533edd3423a2e9bcf502dd6f0bbc94c746b864395ea0c7d104183326d7a",
  "disconnectReason": "CONNECTION_LOST",
  "versionNumber": 0
}

not much useful but I’ll keep looking and try a reconnect strategy. The LWT goes off as well. Does the OTA agent handle reconnects without aborting the entire image download?

e: i forgot to mention, the primary connectivity is LTE-M using lwip PPP, so it very well may be legit but I’d like to avoid the reconnect cycle if possible.

Edited by: dmh24 on Dec 27, 2019 5:11 AM

Edited by: dmh24 on Dec 27, 2019 5:13 AM

aggarg · December 27, 2019, 8:03pm

Gaurav-Aggarwal-AWS wrote on December 27, 2019:

The disconnect reason CONNECTION_LOST means that the disconnect was initiated by the client. It will be helpful to capture network traffic (if possible) to see why the client is initiating disconnect. Which TCP/IP stack are you using? Is it possible to enable verbose logging in that stack?

Thanks.

system · December 28, 2019, 3:15pm

dmh24 wrote on December 28, 2019:

I’m using LWIP stack, but after enabling debug iot logging, the reason for disconnect is no PINGRESP received in given timeout. After some more testing, it is only happening during OTAs, but this is getting strange: if I change the OTA blocksize to 4096 to blocks, the problem happens far less. However, changing the block size back to 1024 , it will miss PINGRESPs 100% of the time. If I comment out the PINGRESP checking code, 1024 blocks work 100% of the time too. I’ll still work on getting full network traffic logs to see if it was truly missed but it seems like there is something else going on.

aggarg · December 30, 2019, 9:29pm

Gaurav-Aggarwal-AWS wrote on December 30, 2019:

Reducing the block size results in more number of chunks and therefore more incoming packets. A possibility is that the because of more incoming packets the PINGRESP is delayed. Would you please try to increase the value of IOT_MQTT_RESPONSE_WAIT_MS?

Thanks.

system · December 31, 2019, 8:51pm

dmh24 wrote on December 31, 2019:

I just had a bit of a shower thought after a 1024 blocksize OTA finished once out of 10 times, since the issue is so fleeting i’m very suspect the critical regions around network in/out are not working correctly. I will review that, backport to a device with ethernet and capture logs for proper analysis.

Thanks everyone for helping me chip at this!

system · December 31, 2019, 7:44pm

dmh24 wrote on December 31, 2019:

I increased IOT_MQTT_RESPONSE_WAIT_MS as a very first attempt to improve things,
I’ve tried from 1000 to 5 * 60 * 1000 with no difference

aggarg · December 31, 2019, 9:00pm

Gaurav-Aggarwal-AWS wrote on December 31, 2019:

That’s a possibility and we need to do more analysis to understand if that is the cause. Please share the logs when you have and we will be more than happy to help.

Thanks.

aggarg · December 31, 2019, 9:07pm

Gaurav-Aggarwal-AWS wrote on December 31, 2019:

Also note that as per the MQTT protocol spec, the MQTT client should only send PINGREQs when the connection is idle but our current implementation keeps sending periodic PINGREQs even on an active connection. This is on our list and we are working to address this.

Thanks.

system · January 8, 2020, 3:12am

dmh24 wrote on January 08, 2020:

well, with no code changes I'm now getting a different error:


[ERROR][NET][175400] Error -27648 while sending data.
[ERROR][MQTT][175400] (MQTT connection 0x8001c490) Failed to send PINGREQ.

this happens while the application is completely idle and pretty quickly. It's coming deep from somewhere in mbedTLS context being invalidated. I'm still digging …

system · January 8, 2020, 4:46am

dmh24 wrote on January 08, 2020:

This seems to be an unrelated issue with MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY on bad requests, even when that does not happen, eventually ping acks are still missed so please ignore above. Still working on ethernet log

system · January 13, 2020, 4:47pm

SarenaAtAws wrote on January 13, 2020:

Hello,

I just wanted to follow up if you are still having issues and if were able to dig up the ethernet logs?

system · January 13, 2020, 11:27pm

dmh24 wrote on January 13, 2020:

Hi,

Yes, the issue is still occuring, though its tolerable with mqttconfigKEEP_ALIVE_INTERVAL_SECONDS at 1200, happening about once every ~20 hours.

I'm still working on getting my ethernet stack going for proper logs though

system · January 15, 2020, 6:05pm

egoltzman wrote on January 15, 2020:

Hi, I hope its OK I'm pushing in , I'm experience some kind of a similar behavior:
My code based is 201906 Major and I'm working on an STM IOT based target board.
I created a successful MQTT connection and I use it with the shadow API, my thing get some occasional delta updates and also send some reported updates but all this is done on a very low frequency, most of the time (~99%) the system is idle.
Now, after some time (10-20 minutes) of successful keep-alive pings (i'm running with full debug log) I suddenly get a:


10832 5073012 [iot_thread] [DEBUG][MQTT][5073012] (MQTT connection 0x100063b8) Checking for PINGRESP.
10833 5073021 [iot_thread] [ERROR][MQTT][5073021] (MQTT connection 0x100063b8) Failed to receive PINGRESP within 1000 ms.

My question is how can I have automatic reconnection, if at all? Is there any example or documentation on this? I did not found any. I'm referring to this [http://aws-iot-device-sdk-embedded-c-docs.s3-website-us-east-1.amazonaws.com/AutoReconnect.html]
Maybe I can use the disconnectCallback but I'm not sure what should I do in order to re connect while I have a running Shadow connection that was establish with the disconnected MQTT connection handler.
Thanks

system · January 15, 2020, 9:33pm

egoltzman wrote on January 15, 2020:

Another update\input
I enlarged IOT_MQTT_RESPONSE_WAIT_MS to 5000 and now the system worked for more then 3 hours and then I got this:


21090 11427989 [iot_thread] [DEBUG][MQTT][11427989] (MQTT connection 0x100063b8) Sending PINGREQ.
21091 11427998 [iot_thread] [ERROR][MQTT][11427998] (MQTT connection 0x100063b8) Failed to send PINGREQ.

I see no other way to handle this but to use some kind of reconnect mechanasim so my question from the previous message are still valid,
Thanks

wanggord · January 16, 2020, 7:24pm

gordon-AWS wrote on January 16, 2020:

Hello,

In both of the cases you listed (failed to receive PINGRESP and failed to send PINGREQ), you should receive the disconnect callback from the MQTT library. In this callback, you can detect the disconnect and re-establish the connection.

From the disconnect callback, you can directly call IotMqtt_Connect. However, Connect takes a long time and will block all background tasks while it's in progress. If this isn't compatible with your application, you can have the disconnect callback post to a queue (or other notification mechanism) and handle the reconnect elsewhere.

Note that you will need to restore any previous subscriptions.

Auto-reconnect is a commonly requested feature, and we are working to add it soon.

system · January 17, 2020, 10:08am

egol wrote on January 17, 2020:

Thank you for your reply.
I have two questions on that:

When you say +“Note that you will need to restore any previous subscriptions”+ how this is related to the shadow of my thing? I actually did not made an explicit subscription to the shadow delta, I called _setShadowCallbacks and I think that this implicitly is doing the subscription, isn’t it? Should I call this again after reconnecting to the MQTT?
I’m not sure I fully understand how does the C SDK and the AWS FreeRTOS (AFR) package is working together. I’m asking this here since I can see that the C SDK include an option for MQTT auto reconnect (aws_iot_mqtt_autoreconnect_set_status) but the MQTT API that is part of the AFR is stored under “standard” libraries in the source code tree while the shadow library, for example, is stored under c_sdk
Can you please elaborate?
Thanks

wanggord · January 17, 2020, 10:59pm

gordon-AWS wrote on January 17, 2020:

Hello,

Yes, you will need to call SetDeltaCallback again after reconnecting.
Apologies for the confusion. The C SDK you mentioned (which has the function aws_iot_mqtt_autoreconnect_set_status) is C SDK version 3. Version 3 does not have anything to do with Amazon FreeRTOS, it is a separate project that existed before Amazon FreeRTOS. We decided to combine these two products going forward, and therefore integrated Version 4 of the C SDK into Amazon FreeRTOS. Because of the differences between C SDK version 3 and Amazon FreeRTOS, the combined product (version 4) is not backwards compatible and is still a work in progress.

Hope this clarifies; please let me know if you have any further questions.