Does not return from MQTT_AGENT_Connect

Hi

I am using Amazon FreeRTOS MQTT V2.1.0.
When I frequently unplug and plug the LAN port, there are rare times when it does not return from the MQTT_AGENT_Connect() function.
It stops at the following process.

  1. call MQTT_AGENT_Connect()
  2. callback function is executed. xMQTTEvent is eMQTTAgentDisconnect
  3. no return from MQTT_AGENT_Connect()

Does anyone know about this phenomenon?

Thanks

Can you break the code in the debugger and see what it is doing when it appears stuck?

I have incorporated the logs and checked them, and it seems that a data abort exception is occurring. I am trying to check by connecting the debugger, but the phenomenon does not occur easily.

Upon investigation, a data abort exception is occurring in the list.c vListInsert function of the FreeRTOS kernel.
There is a comment in this vlistInsert function that describes what should be checked when the application crashes.

3)Calling an API function from within a critical section or when
the scheduler is suspended, or calling an API function that does
not end in “FromISR” from an interrupt.
4)Using a queue or semaphore before it has been initialised or
before the scheduler has been started (are interrupts firing
before vTaskStartScheduler() has been called?)

What measures should I take on the application side when calling MQTT_AgentConnect() and FreeRTOS_connect()? Are there any specific points to be aware of?

The FreeRTOS kernel version is V10.4.3.

Did you define configASSERT and enabled stack checking ?
Something seems to corrupt sone internal data structures or maybe dereferences a NULL pointer, I guess.

configASSERT is defined. It also enables stack checking.
It does not stop on either check.
Would it be possible to improve this by changing the way the API is called?

Ok. Since you figured out the call stack of the crash and it only happens on dis/connect the cable is it possible that e.g. the a closed socket (handle) is used ?
Or is there a possibility of a missing synchronization between PHY link down/up handling and the rest of the networking code ? Is the networking/MQTT code multithreaded and a task misses the handling of link down/up ?

Sorry. My explanation was insufficient.
The LAN connector leading to the MQTT server in the LAN is unplugged and plugged in.
Therefore, the PHY link of the device remains linked up.

Ok. So your test should cause failed connects or connect timeouts to the server ?
Or what’s the problem your application should handle in a failsafe way ?

Yes, I am trying to connect to the server. I am getting failed connections to the server and timeouts.
I don’t know if that had anything to do with it, but I got an abort exception.
I examined the stack and found the following state

FreeRTOS_connect() -> xEventGroupWaitBits() -> vTaskPlaceOnUnorderedEventList()
    -> prvAddCurrentTaskToDelayedList() -> vListInsert()

I got to vListInsert() and found the cautionary comments, so I asked if I could work around this by changing the way the API is called. (e.g., wait a bit before reconnecting and then call it).

The comment mentions possible root causes of crashes of list operations. Waiting a bit before retrying a connect won’t solve this kind of problems.
Unfortunately I’m not a coreMQTT expert and can’t really help any further to nail the problem :frowning:
Maybe there are some logging features to get a better insight what’s going on (as provided by the underlying FreeRTOS+TCP stack)…

1 Like

Thanks for the reply.
I understand about the vListInsert() comment.
It may be triggered by some other issue, but I have no idea.

Can you try increasing the stack size of the task which is calling FreeRTOS_connect?

I have stack check enabled and from my research I have about 30% free space on my stack.
What is the reason for this suggestion? Is there any precedent for this?

The reason for the suggestion is - this seems like a memory corruption and stack overflow is one of the main reasons for memory corruption. Is it possible that you are using a connection after it is closed as mentioned in this issue - Issue Closing TCP Socket - #7 by kohiro.

1 Like

Thanks for the reply.
I understand that this is due to one of the common memory corruption factors.
We are continuing to investigate the possibility of accessing the socket after it is closed.
Do you know if there are similar cases with Amazon FreeRTOS MQTT V2.1.0.