AWS MQTT connection dies after some point

dmh24 wrote on January 19, 2020:

This is curious to me because after doing some research and comparing trees,
I'm basically also on the 201906 release, and after further testing i have realized missing PINGREQ isnt related to data at all, just after some time just like egolzman.

I assume your STM target is using LWIP for network stack? is it the same in tree copy? Its nearly identical to official release with a few small patches near 201908 release, but it makes me wonder if we're missing some later fix in MQTT or such

egol wrote on January 19, 2020:

I'm using WiFi module and the TCP stack is in it so I'm not using LWIP.
BTW, Before I upgrade to the 201906 I used a 1.43 based tree and there I did not get those disconnections

egol wrote on January 23, 2020:

Thank you for your answer.
Is there a simple way I can simulate MQTT disconnection from AWS side?

gordon-AWS wrote on January 23, 2020:

If the client sends some invalid data to the server (I suggest just writing an arbitrary byte to the socket), the server will disconnect the client. You can use this to cause a disconnect.

egol wrote on January 23, 2020:

Thank you very much Gordon for following up this issue with me, your help is great and I appreciate it very much.
I'm struggling now to find the right sequence of actions I need to do after I get a disconnect callback from my shadow connection (actually from the MQTT or the WiFi)
At first I just tried to reestablish the MQTT connection and set the shadow callbacks (as we talked above) but the system was not able to return to a normal operation where I have an MQTT connection up and running and the shadow is being updated successfully.
I tried to debug this and I add a call to AwsIotShadow_Init and AwsIotNetworkManager_EnableNetwork but I still get to a situation where the MQTT is not successfully establish and the system hangs.
What I'm asking is how can I know what is the right and full sequence of actions I need to do in order to have a clean reconnection after a failure. Is there a list somewhere or some guide?
Thanks

egol wrote on January 24, 2020:

Hello DanG, I follow this further as you requested and found that the call to SOCKETS_Connect (in iot_network_afr.c ) return with -1.
When I follow this error I found the source of it coming from ES_WIFI_StartClientConnection at aws_secure_sockets.c under stm32l475_discovery/ports
I can see that the connection to the AP was succeeded and that I got an IP and I can ping the the AWS host.
I want to emphasis that this is happening only and consistently in the second time I’m trying to reconnect.
My original question refer to the list of actions I need to do in order to have a “clean start”, I still think that this is what needed here as we can see that some “remainders” pollute the connection establish path, I can try and trace them but there is always a possibility that something will still left there and will cause unexpected errors, I prefer to have a fresh start after such a failure that cause a connection lost.
What do you think?
Thanks

DanG-AWS wrote on January 24, 2020:

Hello egol, since you are able to reproduce the issue, are you also able to catch the cause of IOT_MQTT_NETWORK_ERROR higher in the call stack? For example, it would be helpful to know if the root cause of the network re-connection failure is indicated by a code coming back from the Wi-Fi module.

leegtataws wrote on January 23, 2020:

Hi

Thanks for the details. We would need some more details about how you are calling IotMqtt_Connect() from disconnect callback.

  1. Are you calling IotMqtt_Connect directly from disconnect callback? If so, that would be the reason why the application is hanging. Disconnect callback runs on IP task and calling IotMqtt_Connect from there would block that task and a connection cannot be established. Moving the clean up and connect to a different application task can solve that issue.

Unfortunately we don't have any examples or documentation for reconnecting MQTT in case of a disconnect.
Please let us know if the suggestion works for you. If you are still facing issue in reconnecting, we will try to come up with step by step sequence in which a reconnect can be done. In the meantime please let us know if you are having any further questions.

Thanks

egol wrote on January 24, 2020:

Hello and thank you for getting back to me.
I'll try to describe my implementation:

  • My code base is 201906
  • My target is a custom board based on the STM IOT kit (B-L475E-IOT01A)
  • I started from the shadow demo and change the main task to fit my application needs.
  • The main task have a message queue that receive messages from other tasks in the application and act accordingly.
  • The main task, initialize communication to the shadow before it starts the loop that handle the messages
    Now to you question of how I'm trying to handle the disconnect event (both WiFi and or MQTT):
  • I defined a disconnect handler when I called IotMqtt_Connect - networkInfo.disconnectCallback = disconnectCallbackInfo;
  • The disconnectCallback start a timer:

void _disconnectCallback(void* pArgument,IotMqttCallbackParam_t* const pParam)
{
    (void)pArgument;
    IotLogError("MQTT Disconnect callback service.");
    xTimerStart(DisconnectRetry_Timer,0);
}

and when the timer expired its callback send a message to the main task about the disconnect event:


void DisconnectRetry_TimerCallback( TimerHandle_t xTimer )
{
    MainLogicQueueData_t msgToMainLogic;
   configASSERT( xTimer );
   msgToMainLogic.eMsgType = eIntTasksMsg_MQTT_Disconnected;
   sendMsgToMainLogic(&msgToMainLogic);
}

The main task, when getting this message is doing this:


                AwsIotShadow_Init(0);
                AwsIotNetworkManager_DisableNetwork(AWSIOT_NETWORK_TYPE_WIFI);
                if( AwsIotNetworkManager_EnableNetwork( configENABLED_NETWORKS ) != configENABLED_NETWORKS )
                {
                    IotLogError( "Failed to intialize all the networks configured for the device." );
                    xTimerStart(DisconnectRetry_Timer,0);
                    continue;
                }
                status = _establishMqttConnection(pIdentifier,
                    pNetworkServerInfo,
                    pNetworkCredentialInfo,
                    pNetworkInterface,
		    &global_wb_gw_mqttConnection);
                status = _setShadowCallbacks(&deltaSemaphore,
                    global_wb_gw_mqttConnection,
                    pIdentifier,
                    thingNameLength);
                _clearShadowDocument(global_wb_gw_mqttConnection, pIdentifier, thingNameLength);

And then the shadow should be recreated and the system should continue to work normally.

What I see is this:

  • I disconnected the Ethernet connection to the AP to simulate a break
  • After around 30 sec (this is the MQTT defined keep alive timeout) I get a message:

120 44968 [iot_thread] [ERROR][MQTT][44968] (MQTT connection 0x10006510) Failed to receive PINGRESP within 5000 ms.
121 44978 [iot_thread] [INFO ][MQTT][44978] (MQTT connection 0x10006510) Network connection closed.
122 44987 [iot_thread] [ERROR][WB_GW][44987] MQTT Disconnect callback service.

I then immediately reconnect the Ethernet and wait for my DisconnectRetry_Timer to expired (also 30 sec), when it does the message is sending to the main task and I see this in the log:


135 74995 [iot_thread] [INFO ][WB_GW][74995] Main logic receive message - eIntTasksMsg_MQTT_Disconnected

136 75004 [iot_thread] [INFO ][Shadow][75004] Shadow library successfully initialized.
137 76179 [iot_thread] [INFO ][WB_GW][76179] Shadow Thing Name is WB_bulk_test_RSA1 (length 17).
138 77980 [iot_thread] [INFO ][MQTT][77980] Establishing new MQTT connection.
139 77987 [iot_thread] [INFO ][MQTT][77987] Anonymous metrics (SDK language, SDK version) will be provided to AWS IoT. Recompile with AWS_IOT_MQTT_ENABLE_METRI140 78003 [iot_thread] [INFO ][MQTT][78001] (MQTT connection 0x10006ac8, CONNECT operation 0x10006958) Waiting for operation completion.
141 78031 [Tmr Svc] Room 1: power - 1, SetPoint - 24, Fan - 3, AmbientTemp - 21
142 78038 [Tmr Svc] Room 2: power - 1, SetPoint - 23, Fan - 1, AmbientTemp - 21
143 78227 [iot_thread] [INFO ][MQTT][78226] (MQTT connection 0x10006ac8, CONNECT operation 0x10006958) Wait complete with result SUCCESS.
144 78239 [iot_thread] [INFO ][MQTT][78239] New MQTT connection 0x200111e8 established.
145 78247 [iot_thread] [INFO ][Shadow][78247] (WB_bulk_test_RSA1) Modifying Shadow DELTA callback.
146 78256 [iot_thread] [INFO ][Shadow][78256] (WB_bulk_test_RSA1) Adding new DELTA callback.
147 78265 [iot_thread] [INFO ][MQTT][78265] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
148 78275 [iot_thread] [INFO ][MQTT][78275] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006be0) Waiting for operation completion.
149 78471 [iot_thread] [INFO ][MQTT][78471] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006be0) Wait complete with result SUCCESS.
150 78483 [iot_thread] [INFO ][Shadow][78483] (WB_bulk_test_RSA1) Shadow DELTA callback operation complete with result SUCCESS.
151 78495 [iot_thread] [INFO ][Shadow][78495] (WB_bulk_test_RSA1) Modifying Shadow UPDATED callback.
152 78504 [iot_thread] [INFO ][Shadow][78504] (WB_bulk_test_RSA1) Adding new UPDATED callback.
153 78513 [iot_thread] [INFO ][MQTT][78513] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
154 78523 [iot_thread] [INFO ][MQTT][78523] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006be0) Waiting for operation completion.
155 78711 [iot_thread] [INFO ][MQTT][78711] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006be0) Wait complete with result SUCCESS.
156 78724 [iot_thread] [INFO ][Shadow][78724] (WB_bulk_test_RSA1) Shadow UPDATED callback operation complete with result SUCCESS.
157 78736 [iot_thread] [INFO ][MQTT][78736] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
158 78746 [iot_thread] [INFO ][MQTT][78746] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Waiting for operation completion.
159 78941 [iot_thread] [INFO ][MQTT][78940] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
160 78955 [iot_thread] [INFO ][MQTT][78953] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
161 78965 [iot_thread] [INFO ][MQTT][78965] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Waiting for operation completion.
162 79152 [iot_thread] [INFO ][MQTT][79151] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
163 79164 [iot_thread] [INFO ][MQTT][79164] (MQTT connection 0x10006ac8) MQTT PUBLISH operation queued.
164 79388 [iot_thread] [INFO ][Shadow][79388] Shadow DELETE of WB_bulk_test_RSA1 was ACCEPTED.
165 79397 [iot_thread] [INFO ][MQTT][79397] (MQTT connection 0x10006ac8) UNSUBSCRIBE operation scheduled.
166 79408 [iot_thread] [INFO ][MQTT][79408] (MQTT connection 0x10006ac8, UNSUBSCRIBE operation 0x10006de0) Waiting for operation completion.
167 79722 [iot_thread] [INFO ][MQTT][79721] (MQTT connection 0x10006ac8, UNSUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
168 79736 [iot_thread] [INFO ][MQTT][79736] (MQTT connection 0x10006ac8) UNSUBSCRIBE operation scheduled.
169 79746 [iot_thread] [INFO ][MQTT][79746] (MQTT connection 0x10006ac8, UNSUBSCRIBE operation 0x10006de0) Waiting for operation completion.
170 79926 [iot_thread] [INFO ][MQTT][79925] (MQTT connection 0x10006ac8, UNSUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
171 79938 [iot_thread] [INFO ][WB_GW][79938] Successfully cleared Shadow of WB_bulk_test_RSA1.
172 79948 [iot_thread] [INFO ][MQTT][79948] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
173 79958 [iot_thread] [INFO ][MQTT][79958] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Waiting for operation completion.
174 80153 [iot_thread] [INFO ][MQTT][80152] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
175 80165 [iot_thread] [INFO ][MQTT][80165] (MQTT connection 0x10006ac8) SUBSCRIBE operation scheduled.
176 80177 [iot_thread] [INFO ][MQTT][80177] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Waiting for operation completion.
177 80356 [iot_thread] [INFO ][MQTT][80355] (MQTT connection 0x10006ac8, SUBSCRIBE operation 0x10006de0) Wait complete with result SUCCESS.
178 80368 [iot_thread] [INFO ][MQTT][80368] (MQTT connection 0x10006ac8) MQTT PUBLISH operation queued.

Up until here everything looks fine, now I try this again but now, after the message received in the main task I get an error message from IotMqtt_Connect :


228 174527 [iot_thread] [ERROR][NET][174527] Failed to establish new connection.

I debug this and found that this is caused in a call to IotMqtt_Connect in this section:


    /* Create a new MQTT connection if requested. Otherwise, copy the existing
     * network connection. */
    if( pNetworkInfo->createNetworkConnection == true )
    {
        networkStatus = pNetworkInfo->pNetworkInterface->create( pNetworkInfo->u.setup.pNetworkServerInfo,
                                                                 pNetworkInfo->u.setup.pNetworkCredentialInfo,
                                                                 &pNetworkConnection );

        if( networkStatus == IOT_NETWORK_SUCCESS )
        {
            networkCreated = true;

            /* This MQTT connection owns the network connection it created and
             * should destroy it on cleanup. */
            ownNetworkConnection = true;
        }
        else
        {
            *IOT_SET_AND_GOTO_CLEANUP( IOT_MQTT_NETWORK_ERROR );*
        }
    }
    else

The error I get is a IOT_MQTT_NETWORK_ERROR
I hope this description is clear,
Thanks

egol wrote on January 24, 2020:

I looked and found that there is an function WIFI_Reset() in the WIFI API, I added it in the begging of the reconnect sequence and its look like the problem disappeared, this is the sequence I'm doing now:


           	WIFI_Reset();
                AwsIotShadow_Init(0);
                AwsIotNetworkManager_DisableNetwork(AWSIOT_NETWORK_TYPE_WIFI);
                if( AwsIotNetworkManager_EnableNetwork( configENABLED_NETWORKS ) != configENABLED_NETWORKS )
                {
                    IotLogError( "Failed to initialize all the networks configured for the device." );
                    xTimerStart(DisconnectRetry_Timer,0);
                    continue;
                }

                status = _establishMqttConnection(pIdentifier,
                    pNetworkServerInfo,
                    pNetworkCredentialInfo,
                    pNetworkInterface,
                    &global_wb_gw_mqttConnection);

                status = _setShadowCallbacks(&deltaSemaphore,
                    global_wb_gw_mqttConnection,
                    pIdentifier,
                    thingNameLength);

Lets consider this as resolved,
Thanks!

gordon-AWS wrote on January 24, 2020:

Glad you figured it out!

Unrelated: are you calling AwsIotShadow_Cleanup somewhere before the call to AwsIotShadow_Init? You must call AwsIotShadow_Cleanup to avoid leaking memory if AwsIotShadow_Init was called before.

jnorgan wrote on February 01, 2020:

Just to share roughly what I am doing in my code in case someone else comes upon this forum post.

MQTT Reconnect Outline

void mqttTask(...) {
    IotMqtt_Init();
    AwsIotShadow_Init(0);
    _establishMqttConnection(...);
    _modifySubscriptions(...);
    _setShadowCallbacks(...);
    
    while(1) {
        if(reconnect_to_mqtt) {
            AwsIotShadow_Cleanup();
            AwsIotShadow_Init(0);
            _establishMqttConnection(...);
            _modifySubscriptions(...);
            _setShadowCallbacks(...);
            reconnect_to_mqtt = false;
        }
    }
}

void someKindOfWiFiReconnectCallback(...) {
  //This cleans up a bunch of MQTT resources and triggers the disconnect callback.
  IotMqtt_Disconnect(mqttConnection, 0);
}

void mqttDisconnectCallback(...){
    //Set flag or post message to trigger re-connect in task that originally setup mqtt connection
    reconnect_to_mqtt = true;
}

Edited by: jnorgan on Jan 31, 2020 11:26 PM

Edited by: jnorgan on Jan 31, 2020 11:27 PM

Hi there,
this is me dmh24 coming back after a long long time. Couldn’t test with ethernet due to some board spins required. I was able to eventually narrow it down on a lot of painful LTE testing.

What happens is Mbed TLS is getting a MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY alert.
The version i am on doesnt have mbed as a git submodule yet, trying to compare them, this code treating the warning as an error is not even present. So I am going to try replacing the entire mbed with same version newer sdk version uses as a start.

The easiest way to find it was enable Mbed TLS debug (MBEDTLS_DEBUG_C) and use lowest keepalive possible of 30 seconds. always happened on the 3rd or 4th ping

well actually i lied, the code is basically the same. This is ending up being treated as a fatal alert even tho it is warning in ssl_tls.c line 5182, or on master submodule here

if( ssl->in_msg[0] == MBEDTLS_SSL_ALERT_LEVEL_WARNING &&
            ssl->in_msg[1] == MBEDTLS_SSL_ALERT_MSG_CLOSE_NOTIFY )
        {
            MBEDTLS_SSL_DEBUG_MSG( 2, ( "is a close notify message" ) );
            return( MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY );
        }

I will work to see if ignoring non-fatal alerts does the trick but its odd

Some more info right away - I can publish thousand of app messages without issue. but PINGREQ does it right away, usually on second or third as mentioned.

Oh the pingreq that doesnt work, the keepalive worker prepares a pingreq, sends it and immediately gets a legit connection closed alert back from AWS. there are no errors in CloudWatch though, just the disconnect message which is confusing.

I’m now working on gathering & comparing unencrypted, encrypted data sent over wire in working versus nonworking PINGREQs

** EDIT: looked at pings -

After some more testing I am starting to see what is going on, I think.
I enabled TCP keepalive with a really small window, its not related to that or the underlying network stack.

Here’s the key thing. I commented out the section that sends PINGREQ over network, and made the PINGRESP check always pass and it still got a connection closed message from the server at the same time, so it’s unrelated to the ping message itself.

This is using a MQTT keepalive min of 30 seconds in the connection info. If I use 1200 (the max), and send no network traffic, it stays alive for a little over 1200 as you would think. So it seems AWS is taking the client setting and enforcing it server side if they dont get PINGRESP in some window around it.

Now the key thing is I had HUGE timeouts set on MQTT operations, 15 seconds or so from bench testing, stepping through cell stack, etc. But the MQTT client worker code that handles the PINGREQ/RESP timer waits the full resp timer to see if it got a PINGRESP, then starts the timer again. So on a 30 second keepalive, and waiting 15 seconds for MQTT operations, it appears to drift outside of the window the server enforces…

It would be nice if AWS could confirm any functionality like that while I try more reasonable timeouts on a solid connection!

it appears i ran into this:

Thank you for getting back to us with your findings. The PR that you pointed was a fix to an issue related to sending PINGREQ messages. As I understood, you would have faced the same issue. Have you tried that change and did that fix the issue for you?

Trying to understand more about how we can help you with regards to statement.
“It would be nice if AWS could confirm any functionality like that while I try more reasonable timeouts on a solid connection!”
Can you please help us understand the specific functionality that you are mentioning about?

Sorry, slammed a lot together in that post.

The issue definitely appears to be caused by the issue PR addressed - It was most obvious on 30 second keepalive and 10 second response timeout, that would never run more than 2 minutes. I ended up implementing nearly the same fix before discovering it. Using that, I am able to run much longer (several hours on 30 second keepalive) , but am still getting sporadic disconnects, just far less frequently. still looking in to that. Using max 1200 is probably acceptable for now.

My question for AWS is if the keepalive seconds parameter on the MQTT client is sent to the server, and the server then uses it to close the connection server side if it doesnt get a PINGREQ in that window, it definitely appears so.

The ping timeouts I am still occasionally seeing are appearing to be the same (server side disconnect coming in from mbed tls as 1:0 alerts), so I’m pretty stumped

Thanks for getting back to us with the specific question.

You are right about the PINGREQ behavior about AWS IoT MQTT broker. Please find the documentation about the disconnect due to keep alive timeout in the link below
https://docs.aws.amazon.com/iot/latest/developerguide/life-cycle-events.html

Quoting the keep alive timeout disconnect behavior from the above link.
“If there is no client-server communication for 1.5x of the client’s keep-alive time, the client is disconnected.”

Hope this answered your question.

Also, about the sporadic disconnects you are still facing, please let us know if the fix for PINGREQ helped you with the issue. If you are continuing to face that issue even after applying the above mentioned fix, please get back to us.

Hi there, thanks for the link, that clarifies a lot. Back porting the fix ended up completely fixing the issue.

PINGREQ disconnects were still coming in randomly after applying fix, but it was caused by another, pretty severe issue that turned up that would end up corrupting any task. Got my fill of red herring