FUOTA with AWS: State Inconsistency after a reboot?

Hi,

I am using the AWS IoT OTA library to perform updates. I want to make sure that the OTA status, as reported in the AWS job console, stays in tune with the actual status/firmware version running on the target device (ESP32). I found a scenario where this is not ensured:

After the self-test procedure is done, the ESP32 marks the current running firmware as valid.
The next step is to notify the cloud that the update was accepted. But if a reboot occurs (ESP32 gets unplugged) before this notification, the notification is never sent.

Upon the next boot, the cloud will again ask for a self-test. The ESP32 will see he’s not in self-test mode and try to mark the current valid firmware are invalid (which he can not do in the default example version) and send a rejection notification to the cloud.

As a result, the cloud will mark the job as rejected, but the ESP32 will still run the updated firmware.

Can someone confirm me that he also experiences the same facts ? Am I missing a configuration flag somewhere ? And what do you think is the best solution for this ? I though of 2:

  1. In the AWS OTA library: before setting the new image state, write in internal memory that the ACCEPT notification should be sent to bypass the self-test in case the cloud asks for it next time
  2. Changing the default application port (ota_pal.c) or any custom application (not directly related to the OTA lib) such that it rollbacks when a self-test is triggered by the cloud while being in an IMG_VALID state [assuming passive firmware was not deleted yet!].

Thank’s in advance.

PS: I added a small delay at the end of the otaPal_SetPlatformImageState function to have time to unplug my device.

Hey @lucas! Welcome to the FreeRTOS Community and thanks for posting to the forums!

This is an interesting issue you’re bringing up here. I’m following how this order of operations could potentially lead to the issues you’re talking about. We are actually currently in the process of creating a new version of the AWS OTA Library. As such I’ve reached out to the team that’s working on that new version to see if this issue is addressed there.

1 Like

Thanks for reaching out and we appreciate your patience @lucas! This one was an interesting question that I haven’t been asked before so I wanted to do a little digging around before I posted anything.

To your questions:

Can someone confirm me that he also experiences the same facts ?

I haven’t tested this out personally but I am certain that what you are describing is the behavior. Looking through the code is a pretty obvious fault which would occur given a power outage at exactly the right time. That exact time being what you’ve highlighted - after the device is updated but before the cloud status update is sent. It should be noted that power outages between setting the device state and sending the cloud message are extremely rare given the time window which this would have to occur .

Am I missing a configuration flag somewhere ?

There is no configuration entry to account for this this behavior,

And what do you think is the best solution for this ? I though of 2:

I really like your suggested solutions here. From my experience with the library, I think it will be far easier to modify functions that you control (the Agent callback, PAL callbacks, etc.) to achieve this functionality than the agent. Basically, option 2 is the quicker of the solutions.

Option 1 seems like a better long term solution rather than waiting for the cloud to ask for status, the device on boot will immediately send out the accepted message.

1 Like

Now for a shameless plug:

The new OTA libraries (pending release so they are in this labs repo) that we are working on now as mentioned above by @skptak will make configuring this behavior far easier. The new libraries are smaller with a narrower focus and designed to be composable into systems you use. The three libraries we have will help with interacting with AWS IoT Jobs (which is what the OTA update goes over), parsing the OTA update document, and streaming the update data to your device. The behavior is left to you - though we will publish examples ‘orchestrating’ these libraries to accomplish an OTA update.

This was done to make updating a FreeRTOS device easier than ever before by allowing you to use whatever OTA service (we still like to think AWS is great, and will continue to support it more extensively), downloading mechanism, and payload format that you like.

Thank’s a lot for your answer. As I understood, in the newer version of the OTA library, the status update from the embedded device is left to the orchestrator, and the orchestrator is to be implemented by ourselves. I have a small question regarding this new OTA library, is it OS-agnostic ? Or does it only work on FreeRTOS ?

Regarding the options I proposed to solve the issue, in the meantime I came up with an even simpler solution:

Option 3: If the images we store in memory also contain their signature, which is the case if we use MCUboot for example, then we could tweek the OTA library to never reject an OTA job in case the signature, present in the OTA job request received from cloud, matches the signature of the current image and the current image is in running state (not in pending_selftest). The only drawback I see is that if we send twice the same OTA job to a device, all of them will be accepted. But in my opinion it’s not a problem. It could even be used as a check to make sure a device has a given FW version.

What do you think ? Option 2 is a bit strange after all; a valid image should not be marked as invalid again…