FUOTA with AWS: State Inconsistency after a reboot?

lucas · September 4, 2023, 3:09pm

Hi,

I am using the AWS IoT OTA library to perform updates. I want to make sure that the OTA status, as reported in the AWS job console, stays in tune with the actual status/firmware version running on the target device (ESP32). I found a scenario where this is not ensured:

After the self-test procedure is done, the ESP32 marks the current running firmware as valid.
The next step is to notify the cloud that the update was accepted. But if a reboot occurs (ESP32 gets unplugged) before this notification, the notification is never sent.

Upon the next boot, the cloud will again ask for a self-test. The ESP32 will see he’s not in self-test mode and try to mark the current valid firmware are invalid (which he can not do in the default example version) and send a rejection notification to the cloud.

As a result, the cloud will mark the job as rejected, but the ESP32 will still run the updated firmware.

Can someone confirm me that he also experiences the same facts ? Am I missing a configuration flag somewhere ? And what do you think is the best solution for this ? I though of 2:

In the AWS OTA library: before setting the new image state, write in internal memory that the ACCEPT notification should be sent to bypass the self-test in case the cloud asks for it next time
Changing the default application port (ota_pal.c) or any custom application (not directly related to the OTA lib) such that it rollbacks when a self-test is triggered by the cloud while being in an IMG_VALID state [assuming passive firmware was not deleted yet!].

Thank’s in advance.

PS: I added a small delay at the end of the otaPal_SetPlatformImageState function to have time to unplug my device.

skptak · September 5, 2023, 11:39pm

Hey @lucas! Welcome to the FreeRTOS Community and thanks for posting to the forums!

This is an interesting issue you’re bringing up here. I’m following how this order of operations could potentially lead to the issues you’re talking about. We are actually currently in the process of creating a new version of the AWS OTA Library. As such I’ve reached out to the team that’s working on that new version to see if this issue is addressed there.

kstribrn · September 21, 2023, 8:35pm

Thanks for reaching out and we appreciate your patience @lucas! This one was an interesting question that I haven’t been asked before so I wanted to do a little digging around before I posted anything.

To your questions:

Can someone confirm me that he also experiences the same facts ?

I haven’t tested this out personally but I am certain that what you are describing is the behavior. Looking through the code is a pretty obvious fault which would occur given a power outage at exactly the right time. That exact time being what you’ve highlighted - after the device is updated but before the cloud status update is sent. It should be noted that power outages between setting the device state and sending the cloud message are extremely rare given the time window which this would have to occur .

Am I missing a configuration flag somewhere ?

There is no configuration entry to account for this this behavior,

And what do you think is the best solution for this ? I though of 2:

I really like your suggested solutions here. From my experience with the library, I think it will be far easier to modify functions that you control (the Agent callback, PAL callbacks, etc.) to achieve this functionality than the agent. Basically, option 2 is the quicker of the solutions.

Option 1 seems like a better long term solution rather than waiting for the cloud to ask for status, the device on boot will immediately send out the accepted message.

kstribrn · September 21, 2023, 8:42pm

Now for a shameless plug:

The new OTA libraries (pending release so they are in this labs repo) that we are working on now as mentioned above by @skptak will make configuring this behavior far easier. The new libraries are smaller with a narrower focus and designed to be composable into systems you use. The three libraries we have will help with interacting with AWS IoT Jobs (which is what the OTA update goes over), parsing the OTA update document, and streaming the update data to your device. The behavior is left to you - though we will publish examples ‘orchestrating’ these libraries to accomplish an OTA update.

This was done to make updating a FreeRTOS device easier than ever before by allowing you to use whatever OTA service (we still like to think AWS is great, and will continue to support it more extensively), downloading mechanism, and payload format that you like.

lucas · September 22, 2023, 7:41am

Thank’s a lot for your answer. As I understood, in the newer version of the OTA library, the status update from the embedded device is left to the orchestrator, and the orchestrator is to be implemented by ourselves. I have a small question regarding this new OTA library, is it OS-agnostic ? Or does it only work on FreeRTOS ?

Regarding the options I proposed to solve the issue, in the meantime I came up with an even simpler solution:

Option 3: If the images we store in memory also contain their signature, which is the case if we use MCUboot for example, then we could tweek the OTA library to never reject an OTA job in case the signature, present in the OTA job request received from cloud, matches the signature of the current image and the current image is in running state (not in pending_selftest). The only drawback I see is that if we send twice the same OTA job to a device, all of them will be accepted. But in my opinion it’s not a problem. It could even be used as a check to make sure a device has a given FW version.

What do you think ? Option 2 is a bit strange after all; a valid image should not be marked as invalid again…

kstribrn · October 17, 2023, 4:52pm

The new OTA library is designed to be completely OS and device agnostic. A really basic example we’ve been playing with for simple validation can be found here. This demo doesn’t do any validation, flashing, or bootloading but it does show the general flow of things and the APIs usage we envisioned. Instructions can be found here.

kstribrn · October 17, 2023, 4:57pm

Option 3 seems very reasonable.

The only drawback I see is that if we send twice the same OTA job to a device, all of them will be accepted

I could see overcoming this by having an ‘ignore-and-flash’ flag which could be set in the OTA job on the cloud side. If this is clear, then the agent could check the signatures and accept or fail. If it is set, it would ignore the signature and go through the entire OTA process even if it already has that firmware.

lucas · October 26, 2023, 1:12pm

Thank’s for your answer about the “ignore-and-flash” flag.

To be precise; the initial problem I was stating was happening after the update was downloaded, verified (ECDSA), launched & self-tested. Then, the device changes the image state from SELF-TEST to VALID. Then, before the notification is sent to the cloud about the new status, the device reboots. (i.e. the cloud will re-attempt a self-test procedure until it gets a confirmation).

What I meant with “always accepting an update whose signature corresponds to the current FW” actually boils down to always send an ACCEPT to a cloud-based message containing a self-test request.

Hence, I don’t think your flag would make any difference in here… but it is true, if we try to re-sent the same FW from the cloud we may decide to force a re-flash … (imo the device should always reject a FW whose version is lower or equal the current FW).

lucas · October 26, 2023, 1:13pm

Thank’s. Is there a more precise release date planned for the new OTA library ? I now it should come this year, but I can’t wait to try it.

kstribrn · November 22, 2023, 5:17pm

Sorry for the massive delay in responding @lucas. I got sucked in to a project for re:Invent 2023 which took up most of my time.

The new OTA component libraries along with demo ‘orchestrators’ (think agents but completely customizable) were actually just released yesterday! You can find the mqtt streaming component library here (version 1.0.0) and the updated AWS IoT jobs library here (version 1.4.0).

The webpages explaining the orchestrator demos and their final repository is a work in progress and should be deployed in the next few weeks.