Network hangs during MQTT test

iabra wrote on Tuesday, June 25, 2019:

Hello,

I’m running a test of the MQTT Client interface on a FreeRTOS/LWIP platform and IAR debugger on a device connected to the local LAN. LWIP version is 2.1.2.

The problem is that after 1, 2,…15 hours of test the network hangs (no ping, apparently no interrupts) and after breaking with the debugger the stack (Tasks List) results corrupted as in the picture.
Apart from the network all the other tasks work correctly. There was no exception or ASSERT apparently fired.

The test consists in publishing 2 values (SysTime and T_Ext) every 2 seconds on a public broker (cloudmqtt.com) and subscribing to 1 message (‘LED’). The settings of mqtt_opts are left default. The problem is more likely to happen when MQTT tracing is disabled, and expecially (but not only) upon receiving the message subscribed.

These are the last lines of output I got before network hangs:

_tcp_sent_cb: Calling QoS 0 publish complete callback
STACK WaterMark =729
Publish: ICON_65646464/SysTime=5565373
mqtt_publish: Publish with payload length 7 to topic “ICON_65646464/SysTime”
mqtt_output_send: tcp_sndbuf: 8192 bytes, ringbuf_linear_available: 11, get 245, put 21
Publish: ICON_65646464/T_Ext=23.4
mqtt_publish: Publish with payload length 4 to topic “ICON_65646464/T_Ext”

_tcp_sent_cb: Calling QoS 0 publish complete callback
MQTT: LED ON
STACK WaterMark =729
Publish: ICON_65646464/SysTime=5565929
mqtt_publish: Publish with payload length 7 to topic “ICON_65646464/SysTime”
mqtt_output_send: tcp_sndbuf: 8192 bytes, ringbuf_linear_available: 32, get 48, put 80
Publish: ICON_65646464/T_Ext=22.9
mqtt_publish: Publish with payload length 4 to topic “ICON_65646464/T_Ext”

iabra wrote on Tuesday, June 25, 2019:

PCB state when this problem happens

rtel wrote on Tuesday, June 25, 2019:

Hmm, there would seem to be multiple places that could potentially cause this type of corruption, but I suggest starting with the network driver. Which chip are you running on? Where did you get the lwIP and lwIP port from?

iabra wrote on Thursday, June 27, 2019:

Thanks Richard.

Checking the sources I found the Port belonging to an older SDK so I updated to SDK 2.6.0 (and LWIP 2.1.2). The network chip is internal at the K65 NXP micro.

The driver in use is

; @file: startup_MK65F18.s
; @purpose: CMSIS Cortex-M4 Core Device Startup File
; MK65F18
; @version: 3.0
; @date: 2015-3-25
; @build: b151210

After the upgrade I 've never seen the ‘Task Stack’ corrupted in the debugger as in the first image, but the problem reported in the second one still happens, at random intervals, sometimes leading to reset of the controller -without exception - sometimes just hanging the network.

i.e. in the function mqttoutputsend()

err = altcpwrite(tpcb, mqttringbufgetptr(rb), sendlen, TCPWRITEFLAGCOPY | (wrap ? TCPWRITEFLAGMORE : 0));

if (err == ERROK) {
mqttringbufadvancegetidx(rb, sendlen);
/ Flush /
altcpoutput(tpcb);
}*

altcp_output get stuck in this loop:

/ useg should point to last segment on unacked queue /
useg = pcb->unacked;
if (useg != NULL) {
for (; useg->next != NULL; useg = useg->next);
}

…due to corruption of the PCB as reported

If so, I also wonder why the previos call to altcp_write() can pass the check
(tcp_write_checks) without entering in

LWIPASSERT(“tcpwrite: no pbufs on queue => both queues empty”,
pcb->unacked == NULL && pcb->unsent == NULL);

If you have any ideas on where I could look or need some more info, let me know.

Bye

iabra wrote on Friday, June 28, 2019:

Maybe it is not the definitive solution and I’m asking you to recheck , but it seems to work better this way:

Note:
The segment pointed by **useg ** was used in comparations despite of a possible - and in my case verified - NULL value.


err_t tcp_output(struct tcp_pcb *pcb)
{
...
/* put segment on unacknowledged list if length > 0 */
if (TCP_TCPLEN(seg) > 0){
    seg->next = NULL;
    /* unacked list is empty? */
    if (pcb->unacked == NULL)     {
        pcb->unacked = seg;
        useg = seg;
        /* unacked list is not empty? */
    } 
    //else
    else if (useg)//**(-IABR Jun 29, 2019 useg CAN be NULL )**
    { 
     /* In the case of fast retransmit, the packet should not go to the tail
		* of the unacked queue, but rather somewhere before it. 
		* We need to check for this case. -STJ Jul 27, 2004 */
        if (TCP_SEQ_LT(lwip_ntohl(seg->tcphdr->seqno), lwip_ntohl(useg->tcphdr->seqno))) {
     /* add segment to before tail of unacked list, keeping the list sorted */
            struct tcp_seg **cur_seg = &(pcb->unacked);
            while (*cur_seg &&
            TCP_SEQ_LT(lwip_ntohl((*cur_seg)->tcphdr->seqno), lwip_ntohl(seg->tcphdr->seqno))) {
                cur_seg = &((*cur_seg)->next );
            }
            seg->next = (*cur_seg);
            (*cur_seg) = seg;
        } 
        else {
            /* add segment to tail of unacked list */
            useg->next = seg;
            useg = useg->next;
        }
    }
    /* do not queue empty segments on the unacked list */
} 
...
}

rtel wrote on Friday, June 28, 2019:

Sorry - I’m not too familiar with lwIP code these days (we have had our
own TCP/IP stack for some years now) so I’m not sure what I’m looking at
here.

mqtt_delete_request() method has no protection against queue loops. I didn’t figure out how loops are created during putting requests into the queue, but breaking such loops fix was trivial and eliminated the problem

/**
 * Delete request item
 * @param r Request item to delete
 */
static void
mqtt_delete_request(struct mqtt_request_t *r)
{
  if (r != NULL) {
      // Break potential loops
      if (r->next != r) {
          r->next = r;
      }
      else {
          r->next = NULL;
      }
  }
}