MQTT reconnect issue

For a long time, I have been experiencing a rare problem after the connection on the external 4g channel (MCU - Ethernet - Router-4g usb modem) was lost, communication with the broker was not restored. I couldn’t reproduce the problem at all. Today I managed to connect remotely to the device and use the add vTCPNetStat() function;

I saw this result:
TCP 46129 c0a865edip :62801 1/1 eESTABLISHED 5993 14359
TCP 25303 6dfe1f2eip : 1885 0/1 eCLOSE_WAIT 999999 0
the socket can hang indefinitely in this state and the MQTT cycle is stopped. Can you tell me how to deal with this problem? My code is based on the MqttAgent example code.

Have you set ipconfigTCP_HANG_PROTECTION: FreeRTOS-Plus-TCP Configuration - FreeRTOS™?

1 Like

Along with enabling ipconfigTCP_HANG_PROTECTION, please also set the FREERTOS_SO_RCVTIMEO option on your MQTT socket.

This ensures that FreeRTOS_recv() does not block indefinitely. Without a receive timeout, the MQTT task can remain stuck waiting for data even after the stack has already closed a hung socket, which prevents the reconnection logic from executing.

1 Like

Thank you for reply. I understand that the problem occurs on the 4g side of the carrier. I couldn’t reproduce it on the test bench in any way. My router has the ability to point filter packets by flags. Can you tell me how i can repeat a similar situation by dropping packets with certain flags?

@moninom @aggarg

I have these options installed, and the problem occurs when they are installed. Here is a piece of code, and as I understand it, there is a stop somewhere inside the MQTTAgent_CommandLoop(). Since the task is in a blocked state

 
        xMQTTStatus = MQTTAgent_CommandLoop( &xGlobalMqttAgentContext );
        if( xMQTTStatus == MQTTSuccess )
        {

            xNetworkResult = prvSocketDisconnect( &xNetworkContext );

        }

        else
        {
       

since the function does not exit with an error, the function xNetworkResult = prvSocketDisconnect( &xNetworkContext ) is not called, which would close the connection. and the socket hang in eCloseWait

TCP 4691 6dfe1f2eip : 1885 1/1 eCLOSE_WAIT 999999 1124

MQTT Agent task can be in the blocked state if there is no MQTT communication. It should periodically wake up to send keep alive messages. One possibility is that your network interface implementation is not returning error even after the connection is broken. If that is the case, the agent will detect a broken connection only after the keep alive timeout - coreMQTT/source/include/core_mqtt_serializer.h at main · FreeRTOS/coreMQTT · GitHub. Would you please try to reduce the keep alive value?

Thank you for reply.

the corresponding options are enabled in my code.

#define mqttKEEP_ALIVE_INTERVAL_SECONDS       ( 30U )

xConnectInfo.keepAliveSeconds = mqttKEEP_ALIVE_INTERVAL_SECONDS;

and accordingly, in the logs, I see the ping command executed every 30 seconds

[13:30:41:659] < [DEBUG] [MQTT] [MQTT_Ping:3087] MQTT PINGREQ packet size is 2.
[13:30:41:659] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
[13:30:41:668] < [DEBUG] [MQTT] [sendBuffer:905] sendBuffer: Bytes Sent=2, Bytes Remaining=0
[13:30:41:668] < [DEBUG] [MQTT] [MQTT_Ping:3136] Sent 2 bytes of PINGREQ packet.
[13:30:41:721] < [DEBUG] [MQTT] [Agent_GetCommand:115] Agent pool get space 1
[13:30:41:721] < [DEBUG] [MQTT] [Agent_GetCommand:120] mqtt agent command 2
[13:30:41:721] < [DEBUG] [MQAG] [processCommand:603] com agent command:1
[13:30:41:721] < [DEBUG] [MQAG] [processCommand:630] call conclude
[13:30:41:721] < [DEBUG] [MQTT] [Agent_ReleaseCommand:138] Agent pool release space 0
[13:30:41:721] < [DEBUG] [MQTT] [Agent_ReleaseCommand:143] Returned Command Context 6 to pool
[13:30:41:721] < [DEBUG] [MQTT] [remainingLengthEncodedSize:479] Encoded size for length 0 is 1 bytes.
[13:30:41:721] < [DEBUG] [MQTT] [handleIncomingAck:1644] Received packet of type d0.

when the code is running abnormally, I don’t see any topics in the MQTT and MQAG logs. Unfortunately, this error occurs very rarely, and there is no way to write a log constantly. However, I have the ability to connect through CLI and view certain statuses. For example, the function MQTT_CheckConnectStatus(&xGlobalMqttAgentContext.mqttContext); returns the value MQTTStatusConnected. Can you suggest a more accurate way to implement the idea function through CLI to determine the stop location of the MQTT LOOP? However, when the connection is restored on the device, the mqttloop function does not restore the connection and command processing, and only resetting the device solves the problem.

Are you able to use debugger?

We need to find out where the agent is stuck. Usually debugger is very helpful in such situations.

Unfortunately, no. These devices are located on a remote object, and the error may occur a couple of times a month. But I can sometimes connect via cli and run certain functions. as an option, I can add a variable to the mqtt loop and change it to see at what value the variable will stop

Not sure how useful would that be. How about adding logs at points which can guide us where the Agent task is stuck? I suspect it might be the network interface implementation.

After some time, I managed to capture the traffic when the problems occurred. This issue is related to packet loss due to poor connectivity. But this shouldn’t affect connection reestablishment. Why isn’t the connection being reset? The connection was lost at 4:56 PM. And the process stopped at 5:00 PM. At 5:22 PM, I reset the MCU. With this issue, the xMQTTStatus = MQTTAgent_CommandLoop( &xGlobalMqttAgentContext ) loop doesn’t exit, and accordingly, the socket doesn’t close. However, when querying the value of
MQTTStatus_t MQTT_CheckConnectStatus( MQTTContext_t * pContext ), the status returned is MQTTStatusNotConnected.

Are you able to get the callstack to see where this function is stuck?

Unfortunately, I can’t because this error occurs on a remote device to which I can’t connect a debugger. The router was able to capture packets, which helped me figure out the cause. I’ve never had similar issues with the test board in the lab. Is there a way to replicate this using a Mikrotik router and create rules to cut packets? Then I can look at it in the debugger.

I am not sure about router specific part but you can try unplugging the ethernet cable from the router to simulate connectivity loss.

I tried disconnecting the Ethernet cable on the dev board many times, and the connection was always restored successfully. Here’s an example where the connection wasn’t lost. I understand the problem lies in the number of packets PSH, ACK after which the stack stops waiting. The screenshot below shows that after several unsuccessful attempts, the ACK packet arrived and the loop didn’t stop.

I need to somehow simulate the loss of PSH,ACK packets from the broker so that the cycle stops. How can I then find out exactly callstack to see where this function mqttloop is stuck?? After all, this task will be in a blocked state, and breakpoints won’t be triggered in this function.

If your your debugger is FreeRTOS aware, you should be able to get callstack for all the tasks.

I managed to create a situation where this problem reoccurs. (Random blocking of Ask packets (90% loss).
Here’s the call log:

 [13:03:53:662] < [DEBUG] [MQTT] [calculatePublishPacketSize:628] PUBLISH packet remaining length=520 and packet size=523.
 [13:03:53:662] < [DEBUG] [MQTT] [remainingLengthEncodedSize:479] Encoded size for length 520 is 2 bytes.
 [13:03:53:673] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:5
 [13:03:53:673] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=5, Bytes Remaining=518
 [13:03:53:673] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:16
 [13:03:53:673] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=16, Bytes Remaining=502
 [13:03:53:673] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:502
 [13:03:53:689] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=502, Bytes Remaining=0
 [13:03:53:689] < [DEBUG] [MQAG] [processCommand:634] call conclude
 [13:03:53:689] < [DEBUG] [MQAG] [concludeCommand:852] call calbback from agent
 [13:03:53:689] < [DEBUG] [MQ_AG] [prvPublishCommandCallback:392] CallBack code 0
 [13:03:53:689] < [DEBUG] [MQ_AG] [prvWaitForCommandAcknowledgment:418] notification received 537104056, state: 1 
 [13:03:53:689] < [DEBUG] [MQ_AG] [publishTelemetry:1128] publish ACK  state: 1
 [13:03:53:689] < [DEBUG] [MQTT] [Agent_ReleaseCommand:138] Agent pool release space 0
 [13:03:53:689] < [DEBUG] [MQTT] [Agent_ReleaseCommand:143] Returned Command Context 3 to pool
 [13:03:54:213] < [DEBUG] [MQ_AG] [publishTelemetry:1134] sen telemetry to pub queue
 [13:03:55:692] < [DEBUG] [MQTT] [MQTT_Ping:3087] MQTT PINGREQ packet size is 2.
 [13:03:55:692] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
 [13:03:55:707] < [DEBUG] [MQTT] [sendBuffer:905] sendBuffer: Bytes Sent=2, Bytes Remaining=0
 [13:03:55:707] < [DEBUG] [MQTT] [MQTT_Ping:3136] Sent 2 bytes of PINGREQ packet.
 [13:04:00:713] < [ERROR] [MQTT] [receiveSingleIteration:1770] Handling of keep alive failed. Status=MQTTKeepAliveTimeout
 [13:04:00:713] < [ERROR] [MQTT] [receiveSingleIteration:1785] Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
 [13:04:00:713] < [ERROR] [MQAG] [MQTTAgent_CommandLoop:1109] MQTT operation failed with status MQTTKeepAliveTimeout
 [13:04:00:713] < [DEBUG] [MQ_AG] [prvMQTTAgentTask:180] MQTT com loop: 10
 [13:04:00:713] < [DEBUG] [MQTT] [MQTT_Disconnect:3227] MQTT DISCONNECT packet size is 2.
 [13:04:00:713] < [INFO] [MQTT] [MQTT_Disconnect:3251] Disconnected from the broker.
 [13:04:00:713] < [ERROR] [MQTT] [MQTT_Disconnect:3258] MQTT Connection Disconnected Successfully
 [13:04:00:713] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
 [13:04:00:713] < [DEBUG] [MQTT] [sendBuffer:905] sendBuffer: Bytes Sent=2, Bytes Remaining=0
 [13:04:00:742] < [DEBUG] [MQTT] [MQTT_Disconnect:3274] Sent 2 bytes of DISCONNECT packet.
 [13:04:00:742] < [INFO] [MQ_AG] [prvSocketDisconnect:426] Disconnecting TCP connection.
 [13:04:00:742] < [INFO] [MQ_AG] [prvSocketConnect:326] Creating a TCP connection to broker vodasmart.ru:1885.
 [13:04:13:931] < [DEBUG] [MQTT] [remainingLengthEncodedSize:479] Encoded size for length 46 is 1 bytes.
 [13:04:13:942] < [DEBUG] [MQTT] [MQTT_GetConnectPacketSize:1780] CONNECT packet remaining length=46 and packet size=48.
 [13:04:13:942] < [DEBUG] [MQTT] [MQTT_Connect:2787] CONNECT packet size is 48 and remaining length is 46.
 [13:04:13:942] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:12
 [13:04:13:942] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=12, Bytes Remaining=36
 [13:04:13:942] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
 [13:04:13:955] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=2, Bytes Remaining=34
 [13:04:13:955] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:8
 [13:04:13:955] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=8, Bytes Remaining=26
 [13:04:13:955] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
 [13:04:13:955] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=2, Bytes Remaining=24
 [13:04:13:955] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:7
 [13:04:13:971] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=7, Bytes Remaining=17
 [13:04:13:971] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:2
 [13:04:13:971] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=2, Bytes Remaining=15
 [13:04:13:971] < [DEBUG] [MQTT] [Plaintext_FreeRTOS_send:137] FTOS Sen state:15
 [13:04:13:971] < [DEBUG] [MQTT] [sendMessageVector:821] sendMessageVector: Bytes Sent=15, Bytes Remaining=0
 [13:04:23:971] < [ERROR] [MQTT] [receiveConnack:2465] CONNACK recv failed with status = MQTTNoDataAvailable.
 [13:04:23:971] < [ERROR] [MQTT] [MQTT_Connect:2860] MQTT connection failed with status = MQTTNoDataAvailable.
 [13:04:23:971] < [INFO] [MQ_AG] [prvMQTTConnect:558] Session present: 0
 [13:04:23:992] < [DEBUG] [MQ_AG] [prvMQTTAgentTask:211] Mqtt prvMQTTConnest state: 7
 [13:04:23:992] < [DEBUG] [MQ_AG] [prvMQTTAgentTask:175] Start mqtt agent command loop

As you can see, there is no response from the function FreeRTOS_send

    socketStatus = FreeRTOS_send( pNetworkContext->tcpSocket, pBuffer, bytesToSend, 0 );
    LogDebug(("FTOS Sen state:%d",socketStatus));

The error message occurs at this location because confirmation from the function receiveConnack(…..) does not come

static void prvMQTTAgentTask( void * pvParameters )
{
    BaseType_t xNetworkResult = pdFAIL;
    MQTTStatus_t xMQTTStatus = MQTTSuccess, xConnectStatus = MQTTSuccess;
    MQTTContext_t * pMqttContext = &( xGlobalMqttAgentContext.mqttContext );

    ( void ) pvParameters;
    mqtt_conn_state = 0;
    do
    {
        /* MQTTAgent_CommandLoop() is effectively the agent implementation.  It
         * will manage the MQTT protocol until such time that an error occurs,
         * which could be a disconnect.  If an error occurs the MQTT context on
         * which the error happened is returned so there can be an attempt to
         * clean up and reconnect however the application writer prefers. */
        broker_isConnected = TRUE;
        mqtt_conn_state  = 1;
        LogDebug(("Start mqtt agent command loop"));
        xMQTTStatus = MQTTAgent_CommandLoop( &xGlobalMqttAgentContext );
        mqtt_conn_state  = 2;
//        xGlobalMqttAgentContext.mqttContext.connectStatus = xMQTTStatus;
//        broker_isConnected = FALSE;
        LogDebug(("MQTT com loop: %d",xMQTTStatus));
        /* Success is returned for disconnect or termination. The socket should
         * be disconnected. */
        if( xMQTTStatus == MQTTSuccess )
        {
            /* MQTT Disconnect. Disconnect the socket. */
            mqtt_conn_state  = 3;
            xNetworkResult = prvSocketDisconnect( &xNetworkContext );
            mqtt_conn_state  = 4;
        }
        /* Error. */
        else
        {
            #if ( configCREATE_CODE_SIGNING_OTA_DEMO == 1 )
                {
                    vSuspendOTACodeSigningDemo();
                }
            #endif
            mqtt_conn_state  = 5;
            ( void ) MQTT_Disconnect( pMqttContext ); 
            /* Reconnect TCP. */
            xNetworkResult = prvSocketDisconnect( &xNetworkContext );
            mqtt_conn_state  = 6;
            configASSERT( xNetworkResult == pdPASS );
            xNetworkResult = prvSocketConnect( &xNetworkContext );
            mqtt_conn_state  = 7;
            configASSERT( xNetworkResult == pdPASS );
            pMqttContext->connectStatus = MQTTNotConnected;
            mqtt_conn_state  = 8;
            /* MQTT Connect with a persistent session. */
            xConnectStatus = prvMQTTConnect( false );
            LogDebug(("Mqtt prvMQTTConnest state: %d",xConnectStatus));
            mqtt_conn_state  = 9;
        //    configASSERT( xConnectStatus == MQTTSuccess );

            #if ( configCREATE_CODE_SIGNING_OTA_DEMO == 1 )
                {
                    if( xMQTTStatus == MQTTSuccess )
                    {
                        vResumeOTACodeSigningDemo();
                    }
                }
            #endif
        }
    } while( xMQTTStatus != MQTTSuccess );
}


We’re getting a value returned MQTTNoDataAvailable from a function prvMQTTConnect( false ). How should this situation be handled? Should I close the socket and try reconnecting?