Is OTA agent task watching for network disconnects?

FreeOurToes wrote on December 13, 2019:

Hi,

This is related to my previous post
https://forums.aws.amazon.com/thread.jspa?threadID=313434&tstart=0

I haven’t implemented WiFi reconnects and was doing long term runs for testing WDT I had a wifi disconnect occur but OTA agent didn’t get affected it was still printing out messages from


while( ( eState = OTA_GetAgentState() ) != eOTA_AgentState_Stopped )
                {
                    /* Wait forever for OTA traffic but allow other tasks to run and output statistics only once per second. */
                    vTaskDelay( myappONE_SECOND_DELAY_IN_TICKS );
                    configPRINTF( ( "State: %s  Received: %u   Queued: %u   Processed: %u   Dropped: %u\r
", pcStateStr[ eState ],
                                    OTA_GetPacketsReceived(), OTA_GetPacketsQueued(), OTA_GetPacketsProcessed(), OTA_GetPacketsDropped() ) );
                }


Should OTA_GetAgentState() change if network connection is lost?

I have my own task running which is identical to OTA demo plus I have a queue listening from incoming sensor data and reusing mqtt handle if data is received. But else is identical.

This is my debugging output


9659 351428 [iot_thread] [INFO ][MQTT][3514280] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffff5d8) Waiting for operation completion.
9660 351435 [iot_thread] [INFO ][MQTT][3514350] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffff5d8) Wait complete with result SUCCESS.
9661 351435 [iot_thread] IoTMQTT Qos1 Result 0
9662 351436 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9663 351638 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9664 351644 [spi_ac_app] 17 0.008502 0.005917
9665 351644 [spi_ac_app] {"runtime":400,"raw_value":17,"voltage":0.008502}
9666 351839 [iot_thread] queue found {"runtime":400,"raw_value":17,"voltage":0.008502}
9667 351839 [iot_thread] Topic: /robco_local/ac_current/esp32_tests_gg/, Payload: {"runtime":400,"raw_value":17,"voltage":0.008502}.
9668 351839 [iot_thread] [INFO ][MQTT][3518390] (MQTT connection 0x3fff4044) MQTT PUBLISH operation queued.
9669 351839 [iot_thread] IoTMQTT Qos1 Result 1
9670 351840 [iot_thread] [INFO ][MQTT][3518390] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffff604) Waiting for operation completion.
9671 352044 [spi_ac_app] 17 0.008502 0.005917
9672 352044 [spi_ac_app] {"runtime":400,"raw_value":17,"voltage":0.008502}
E (3521852) NetInterface: Failed to tx buffer 0x3fffea7e, len 177, err -1
9673 352340 [iot_thread] [INFO ][MQTT][3523400] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffff604) Wait complete with result TIMEOUT.
9674 352340 [iot_thread] IoTMQTT Qos1 Result 8
9675 352341 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
I (3524602) wifi: bcn_timout,ap_probe_send_start
9676 352444 [spi_ac_app] 17 0.008502 0.005917
9677 352444 [spi_ac_app] {"runtime":400,"raw_value":17,"voltage":0.008502}
9678 352542 [iot_thread] queue found {"runtime":400,"raw_value":17,"voltage":0.008502}
9679 352542 [iot_thread] Topic: /robco_local/ac_current/esp32_tests_gg/, Payload: {"runtime":400,"raw_value":17,"voltage":0.008502}.
W (3525742) wifi: alloc eb len=24 type=3 fail, heap:320

W (3525742) wifi: m f null

E (3525872) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3525872) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3526742) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3527482) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3527482) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3527482) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3527492) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3528742) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3530692) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3530692) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3530692) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3530702) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
E (3530702) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1
9680 353042 [iot_thread] IoTMQTT Qos1 Result 8
9681 353043 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9682 353244 [iot_thread] queue found {"runtime":400,"raw_value":17,"voltage":0.008502}
9683 353244 [spi_ac_app] 19 0.009502 0.006613
9684 353244 [spi_ac_app] {"runtime":400,"raw_value":19,"voltage":0.009502}
9685 353245 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
I (3534122) wifi: bcn_timout,ap_probe_send_start
9686 353446 [iot_thread] queue found {"runtime":400,"raw_value":19,"voltage":0.009502}
9687 353447 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
I (3536622) wifi: ap_probe_send over, resett wifi status to disassoc
I (3536622) wifi: state: run -> init (c800)
I (3536622) wifi: pm stop, total sleep time: 3119358613 us / 3534531782 us

I (3536632) wifi: new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (3536632) WIFI: SYSTEM_EVENT_STA_DISCONNECTED: 200
9688 353644 [spi_ac_app] 17 0.008502 0.005917
9689 353644 [spi_ac_app] {"runtime":400,"raw_value":17,"voltage":0.008502}
9690 353648 [iot_thread] queue found {"runtime":400,"raw_value":19,"voltage":0.009502}
9691 353648 [iot_thread] Topic: /robco_local/ac_current/esp32_tests_gg/, Payload: {"runtime":400,"raw_value":19,"voltage":0.009502}.
9692 353648 [iot_thread] [INFO ][MQTT][3536480] (MQTT connection 0x3fff4044) MQTT PUBLISH operation queued.
9693 353648 [iot_thread] IoTMQTT Qos1 Result 1
9694 353648 [iot_thread] [INFO ][MQTT][3536480] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffffee8) Waiting for operation completion.
9695 354044 [spi_ac_app] 15 0.007502 0.005220
9696 354044 [spi_ac_app] {"runtime":400,"raw_value":15,"voltage":0.007502}
9697 354148 [iot_thread] [INFO ][MQTT][3541480] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffffee8) Wait complete with result TIMEOUT.
9698 354148 [iot_thread] IoTMQTT Qos1 Result 8
9699 354149 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0

starts with netinterface error, then wifi disconnets and mqtt starts to time out.


E (3525872) NetInterface: Failed to tx buffer 0x3ffffaaa, len 177, err -1

Then from timeout it becomes a network error


9731 356254 [iot_thread] [INFO ][MQTT][3562540] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffffee8) Wait complete with result TIMEOUT.
9732 356254 [iot_thread] IoTMQTT Qos1 Result 8
9733 356255 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9734 356444 [spi_ac_app] 14 0.007002 0.004872
9735 356444 [spi_ac_app] {"runtime":400,"raw_value":14,"voltage":0.007002}
9736 356444 [spi_ac_app] AC Queue error
9737 356456 [iot_thread] queue found {"runtime":400,"raw_value":14,"voltage":0.007002}
9738 356456 [iot_thread] Topic: /robco_local/ac_current/esp32_tests_gg/, Payload: {"runtime":400,"raw_value":14,"voltage":0.007002}.
9739 356456 [iot_thread] [INFO ][MQTT][3564560] (MQTT connection 0x3fff4044) MQTT PUBLISH operation queued.
9740 356456 [iot_thread] IoTMQTT Qos1 Result 1
9741 356456 [iot_thread] [INFO ][MQTT][3564560] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffffee8) Waiting for operation completion.
9742 356757 [iot_thread] [ERROR][NET][3567570] Error -27648 while sending data.
9743 356757 [iot_thread] [INFO ][MQTT][3567570] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffffee8) Wait complete with result NETWORK ERROR.
9744 356757 [iot_thread] IoTMQTT Qos1 Result 5
9745 356758 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9746 356844 [spi_ac_app] 14 0.007002 0.004872
9747 356844 [spi_ac_app] {"runtime":400,"raw_value":14,"voltage":0.007002}

And then connection closes?


9943 362445 [iot_thread] [INFO ][MQTT][3624450] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffeb624) Waiting for operation completion.
9944 362446 [iot_thread] [ERROR][NET][3624460] Error -27648 while sending data.
9945 362447 [iot_thread] [INFO ][MQTT][3624470] (MQTT connection 0x3fff4044, PUBLISH operation 0x3ffeb624) Wait complete with result NETWORK ERROR.
9946 362447 [iot_thread] IoTMQTT Qos1 Result 5
9947 362448 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9948 362650 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9949 362759 [iot_thread] [ERROR][NET][3627590] Error -27648 while sending data.
9950 362759 [iot_thread] [ERROR][MQTT][3627590] (MQTT connection 0x3fff4044) Failed to send PINGREQ.
9951 362759 [iot_thread] [INFO ][MQTT][3627590] (MQTT connection 0x3fff4044) Network connection closed.
9952 362844 [spi_ac_app] 11 0.005501 0.003828
9953 362844 [spi_ac_app] {"runtime":400,"raw_value":11,"voltage":0.005501}
9954 362851 [iot_thread] queue found {"runtime":400,"raw_value":11,"voltage":0.005501}
9955 362851 [iot_thread] Topic: /robco_local/ac_current/esp32_tests_gg/, Payload: {"runtime":400,"raw_value":11,"voltage":0.005501}.
9956 362851 [iot_thread] [WARN ][MQTT][3628510] (MQTT connection 0x3fff4044) Attempt to use closed connection.
9957 362851 [iot_thread] [ERROR][MQTT][3628510] (MQTT connection 0x3fff4044) New operation record cannot be created for a closed connection
9958 362851 [iot_thread] IoTMQTT Qos1 Result 5
9959 362852 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0
9960 363054 [iot_thread] State: WaitingForJob  Received: 1   Queued: 0   Processed: 0   Dropped: 0


I am just curious if this behavior is by design.

Just as in OTA demo app there is a networkstatechange callback which should set xNetworkConnected to pdFALSE, but if while loop checking OTA_GetAgentState does not break it doesn’t get a chance to re-init the networking.

Is it a good idea to implement WiFi reconnect in the networkstatechange callback and then in OTA app I should just check for NetworkConnected flag and re ini if it’s false? Or I got a mistake somewhere?
Thanks

PrasadV-AWS wrote on December 19, 2019:

Hello,

Thank you for sharing details about the issue and logs.

The OTA Agent does not handle the MQTT connections and the connection handle is passed from the OTA demo/app to the Agent during initialization. If the connection drops the OTA Agent will retry for number of times as set in the otaconfigMAX_NUM_REQUEST_MOMENTUM and if MQTT publish continues failing the OTA agent state changes to eOTA_AgentState_Stopped after shutdown. The OTA demo then re-establishes the connection and restarts the OTA process.

Please let me know if you have more questions on this topic.