We are observing an interesting side effect after upgrading to FreeRTOS MQTT V2.3.1.
If IotMqtt_Disconnect() is called after connection->disconnected is already set to TRUE due to _IotMqtt_CloseNetworkConnection() being called because of reason = IOT_MQTT_KEEP_ALIVE_TIMEOUT from _IotMqtt_ProcessKeepAlive(), we hit an assert inside _IotMqtt_DecrementConnectionReferences() on IotMqtt_Assert( pMqttConnection->references >= 0 ). pMqttConnection->references is -1.
Did all that make sense ?
Additionally, since IotMqtt_Disconnect() frees the socket to our TCP/IP layer under the hood (TI CC3220 SOCKETS_* layer), we still need to call IotMqtt_Disconnect() at least once ourselves after a KEEP_ALIVE failure. So seems something needs adjusted in the references logic ?
Hi @schoeler, thank you for providing the context of the assert failure. To further understand how the reference count is reaching zero, can you debug the reference count at multiple lines of the
IotMqtt_Disconnect function definition, specifically at:
Line 1480 to know the reference count before
_IotMqtt_CloseNetworkConnection is called.
Line 1483 after the
_IotMqtt_CloseNetworkConnection function is called to see whether the count was affected.
Lines 1490, 1499 to see whether
_mqttOperation_tryDestroy calls made any difference to the connection reference count. Line 1499 will represent the reference count just before the
_IotMqtt_DecrementConnectionReferences function is called, where you are noticing the assert failure.
Line 1480 → references = 0
Line 1483 → references = 0
Line 1490-> references = 0
Line 1499-> references = still 0
Then inside _IotMqtt_DecrementConnectionReferences on line 1502 it does the decrement to -1.
Hope this helps
Hi @schoeler, thank you for sharing the data on the reference count in the
I will try reproducing the issue of reference count decrementing to zero on keep-alive timeout. Can you share information on any operation you perform (like Publish/Subscribe) before you encounter the keep-alive timeout?
Any luck reproducing? We can reproduce this very easily by decreasing the IOT_MQTT_RESPONSE_WAIT_MS to something very small such that it normally fails (we set it to 50 and will see the disconnect assert after this fails)
We are subscribed to the IoT Core Classic Shadow and perform several publishes after boot up, but the pub/sub activity is pretty idle at the time of the keepalive failure.
Hi @schoeler, sorry for the late response. We have been able to reproduce the issue and are investigating the root cause. We will provide an update when we have diagnosed the issue as well as provide a fix for it.
Hi @schoeler, we have identified the root-cause of the reference count issue, and have created a bug-fix change for it in this GitHub Pull Request.
Can you check whether the patch fixes the issue for you?
So far, looks good! I will let it run overnight and report back.
I want to inform that the MQTT V2.3.1 stack is on the path of deprecation, and we encourage use of the new coreMQTT library, if possible for your project. We will also be soon releasing a thread-safe extension of coreMQTT, the coreMQTT-Agent library, which will be recommended library for MQTT operations on FreeRTOS.