IoT OTA update retry behaviour

amstnz wrote on January 30, 2020:

What's the behaviour of OTA when an update fails to be delivered part-way through due to a comms link failure? If the thing reconnects later, will the download resume from the same point? Or will it restart? What has to happen before the update is marked as failed?

Thanks

PrasadV-AWS wrote on February 01, 2020:

Hello,

When the MQTT connection gets disconnected the OTA Agent will shutdown and perform cleanup. After cleanup is complete the OTA demo task will then try to establish a new MQTT connection and restart the OTA process.

The OTA can fail due to number of reasons , like no file blocks received for maximum number of retries, failure in writing the file blocks on the platform etc. The Job is updated with the error codes and following link has more information about how to retrieve them from the job service.

https://docs.aws.amazon.com/freertos/latest/userguide/ota-failure-codes.html

Please let me know if you have any more questions.

amstnz wrote on February 02, 2020:

Thanks Prasad.

So if I understand you correctly, the second OTA attempt will continue from the point where the first one failed? Blocks that have been transferred successfully don't need to be transferred again?

Thanks

PrasadV-AWS wrote on February 03, 2020:

Hi,

The OTA will restart, it will request Job document and create a new file for download. All the blocks will be requested again.

Do you expect to experience many connection failures on your platform during OTA? Can you please share more details which platform and connectivity option you are using as well as Amazon FreeRTOS version?

amstnz wrote on February 03, 2020:

The system in question isn't built yet; I'm evaluating a possible design.

The transport layer for OTA updates may be somewhat expensive, so I need to understand how the retry mechanism will work. If the client will automatically restart downloads from scratch, what controls how many times it will retry before giving up? I'm concerned that for a device that has poor connectivity we could chew through a lot of data this way.

Thanks

PrasadV-AWS wrote on February 04, 2020:

The OTA Agent will retry MQTT operations for number of times set in OTA configuration otaconfigMAX_NUM_REQUEST_MOMENTUM before shutting down (which performs cleanup and deletes the OTA Agent task). The OTA demo/application will then try establishing new MQTT connection and restarts OTA from beginning. This is demonstrated as a continuous loop in OTA demo and can be easily modified to retry for a configured number of times.

I will let our product manager answer your question about resuming OTA on MQTT reconnect feature in case of devices with poor connectivity.

DanN-AWS wrote on February 04, 2020:

OTA Pause / Resume is a feature we have prioritized for 2020. We will provide more information on this feature in the near future.