My device is publish update commands to IOT Core based on the system state changes as well as receive updates from the shadow when needed.
When the system is connected properly to the network things are working smoothly.
In case I disconnect the ethernet cable for testing and leave it for a while till the keep alive timer is triggered and the MQTT on the server side is detect as disconnected and then I reconnect it, the system resume the MQTT session successfully and things continue to work properly as before the disconnection.
The problem arise if during the disconnection the system tries to send a publish command to the MQTTAgent, in that case, the command waits in the command queue of the agent and when the system is connected again the agent tries to send a subscribed command (Packed) with all the previous subscription that was active before the disconnection, and when the agent retrieved (get) the next message to send (and expecting it to be the subscribed command) the update message that is waiting there BEFORE the subscribed command is pop and sent to the server and from here on the session is missing the previous subscriptions and the system stopped to receive updates.
On idea would be to prevent placing commands into the agent queue when the connection is broken. Your code would need to check the state of the connection when attempting to place an MQTT command into the agent queue.
Yes, I tried this by calling xIsMqttAgentConnected() before I’m publish but then I saw that the agent after exiting the MQTTAgent_CommandLoop as of the keep alive timeout calls first mbedtls_transport_disconnect and only after that it sets the EVT_MASK_MQTT_CONNECTED bit (i.e. xEventGroupClearBits( xSystemEvents, EVT_MASK_MQTT_CONNECTED )), and, in my case (STM32H5 with LWIP) I see that this call to mbedtls_transport_disconnect does not return before the reconnection and thus calling xIsMqttAgentConnected that check the EVT_MASK_MQTT_CONNECTED bit return true even though the MQTT session is not connected.
I tried to check the reason why mbedtls_transport_disconnect does not return and I saw that it is stuck on calling vStopSocketNotifyTask to the notify thread. While this thread is stuck on sock_select call so it does not get to check the notify.
I made a change in the agent to set the EVT_MASK_MQTT_CONNECTED bit before the call to mbedtls_transport_disconnect and then the call to xIsMqttAgentConnected before the publish did the job but I did not like the idea that I’m changing the agent.
You probably mean that you cleared the EVT_MASK_MQTT_CONNECTED bit before calling TLS disconnect. I do not see a problem with that. Would you like to raise a PR for that?
Yes @aggarg , you are right, it is clearing the bit not setting it.
Anyhow, I’ll raise a PR on that but I’m curious to know why the mbedtls_transport_disconnect does not succeed to stop the SocketNotify Thread that stuck on the sock_select.
I see that there is a call in mbedtls_transport_disconnect that tries to stop the socket listening of the TLS but since in vSocketNotifyThread the call for sock_select is without timeout this function never return in case of network disconnection.
I tried to add a timeout of 1 second to this sock_select but I’m not sure it is a secure solution in the sense that there might be lost messages from the server in that case, what do you think?