I am using the AWS IoT OTA library to perform updates. I want to make sure that the OTA status, as reported in the AWS job console, stays in tune with the actual status/firmware version running on the target device (ESP32). I found a scenario where this is not ensured:
After the self-test procedure is done, the ESP32 marks the current running firmware as valid.
The next step is to notify the cloud that the update was accepted. But if a reboot occurs (ESP32 gets unplugged) before this notification, the notification is never sent.
Upon the next boot, the cloud will again ask for a self-test. The ESP32 will see heâs not in self-test mode and try to mark the current valid firmware are invalid (which he can not do in the default example version) and send a rejection notification to the cloud.
As a result, the cloud will mark the job as rejected, but the ESP32 will still run the updated firmware.
Can someone confirm me that he also experiences the same facts ? Am I missing a configuration flag somewhere ? And what do you think is the best solution for this ? I though of 2:
In the AWS OTA library: before setting the new image state, write in internal memory that the ACCEPT notification should be sent to bypass the self-test in case the cloud asks for it next time
Changing the default application port (ota_pal.c) or any custom application (not directly related to the OTA lib) such that it rollbacks when a self-test is triggered by the cloud while being in an IMG_VALID state [assuming passive firmware was not deleted yet!].
Thankâs in advance.
PS: I added a small delay at the end of the otaPal_SetPlatformImageState function to have time to unplug my device.
Hey @lucas! Welcome to the FreeRTOS Community and thanks for posting to the forums!
This is an interesting issue youâre bringing up here. Iâm following how this order of operations could potentially lead to the issues youâre talking about. We are actually currently in the process of creating a new version of the AWS OTA Library. As such Iâve reached out to the team thatâs working on that new version to see if this issue is addressed there.
Thanks for reaching out and we appreciate your patience @lucas! This one was an interesting question that I havenât been asked before so I wanted to do a little digging around before I posted anything.
To your questions:
Can someone confirm me that he also experiences the same facts ?
I havenât tested this out personally but I am certain that what you are describing is the behavior. Looking through the code is a pretty obvious fault which would occur given a power outage at exactly the right time. That exact time being what youâve highlighted - after the device is updated but before the cloud status update is sent. It should be noted that power outages between setting the device state and sending the cloud message are extremely rare given the time window which this would have to occur .
And what do you think is the best solution for this ? I though of 2:
I really like your suggested solutions here. From my experience with the library, I think it will be far easier to modify functions that you control (the Agent callback, PAL callbacks, etc.) to achieve this functionality than the agent. Basically, option 2 is the quicker of the solutions.
Option 1 seems like a better long term solution rather than waiting for the cloud to ask for status, the device on boot will immediately send out the accepted message.
The new OTA libraries (pending release so they are in this labs repo) that we are working on now as mentioned above by @skptak will make configuring this behavior far easier. The new libraries are smaller with a narrower focus and designed to be composable into systems you use. The three libraries we have will help with interacting with AWS IoT Jobs (which is what the OTA update goes over), parsing the OTA update document, and streaming the update data to your device. The behavior is left to you - though we will publish examples âorchestratingâ these libraries to accomplish an OTA update.
This was done to make updating a FreeRTOS device easier than ever before by allowing you to use whatever OTA service (we still like to think AWS is great, and will continue to support it more extensively), downloading mechanism, and payload format that you like.
Thankâs a lot for your answer. As I understood, in the newer version of the OTA library, the status update from the embedded device is left to the orchestrator, and the orchestrator is to be implemented by ourselves. I have a small question regarding this new OTA library, is it OS-agnostic ? Or does it only work on FreeRTOS ?
Regarding the options I proposed to solve the issue, in the meantime I came up with an even simpler solution:
Option 3: If the images we store in memory also contain their signature, which is the case if we use MCUboot for example, then we could tweek the OTA library to never reject an OTA job in case the signature, present in the OTA job request received from cloud, matches the signature of the current image and the current image is in running state (not in pending_selftest). The only drawback I see is that if we send twice the same OTA job to a device, all of them will be accepted. But in my opinion itâs not a problem. It could even be used as a check to make sure a device has a given FW version.
What do you think ? Option 2 is a bit strange after all; a valid image should not be marked as invalid againâŚ
The new OTA library is designed to be completely OS and device agnostic. A really basic example weâve been playing with for simple validation can be found here. This demo doesnât do any validation, flashing, or bootloading but it does show the general flow of things and the APIs usage we envisioned. Instructions can be found here.
The only drawback I see is that if we send twice the same OTA job to a device, all of them will be accepted
I could see overcoming this by having an âignore-and-flashâ flag which could be set in the OTA job on the cloud side. If this is clear, then the agent could check the signatures and accept or fail. If it is set, it would ignore the signature and go through the entire OTA process even if it already has that firmware.
Thankâs for your answer about the âignore-and-flashâ flag.
To be precise; the initial problem I was stating was happening after the update was downloaded, verified (ECDSA), launched & self-tested. Then, the device changes the image state from SELF-TEST to VALID. Then, before the notification is sent to the cloud about the new status, the device reboots. (i.e. the cloud will re-attempt a self-test procedure until it gets a confirmation).
What I meant with âalways accepting an update whose signature corresponds to the current FWâ actually boils down to always send an ACCEPT to a cloud-based message containing a self-test request.
Hence, I donât think your flag would make any difference in here⌠but it is true, if we try to re-sent the same FW from the cloud we may decide to force a re-flash ⌠(imo the device should always reject a FW whose version is lower or equal the current FW).
Sorry for the massive delay in responding @lucas. I got sucked in to a project for re:Invent 2023 which took up most of my time.
The new OTA component libraries along with demo âorchestratorsâ (think agents but completely customizable) were actually just released yesterday! You can find the mqtt streaming component library here (version 1.0.0) and the updated AWS IoT jobs library here (version 1.4.0).
The webpages explaining the orchestrator demos and their final repository is a work in progress and should be deployed in the next few weeks.