Empty execution response from DescribeNextJob causing error in OTAAgent - How should it be solved?

After successfully performing an OTA download and boot, on every subsequent boot we are seeing the following in our logs.

0000000000:000054811:OS:Warning: [INFO ][MQTT][54813] (MQTT connection 0x24034398, PUBLISH operation 0x24034688) Wait complete with result SUCCESS.
0000000000:000054811:OS:Warning: [prvOTAAgentTask] Called handler. Current State [RequestingJob] Event [RequestJobDocument] New state [WaitingForJob]
0000000000:000054812:OS:Warning: Success: Pushed event message to queue.
0000000000:000054812:OS:Warning: [prvOTAAgentTask] , State matched 3,[WaitingForJob]
0000000000:000054812:OS:Warning: [prvOTAAgentTask] , Event matched  3,[ReceivedJobDocument]
0000000000:000054812:OS:Warning: [prvParseJobDoc] Size of OTA_FileContext_t [64]
0000000000:000054812:OS:Warning: [prvParseJSONbyModel] Extracted parameter [ clientToken: 0:300534060389100 ]
0000000000:000054812:OS:Warning: [prvParseJSONbyModel] parameter not present: execution
0000000000:000054812:OS:Warning: [prvParseJSONbyModel] parameter not present: jobId
0000000000:000054812:OS:Warning: [prvParseJSONbyModel] parameter not present: jobDocument
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: afr_ota
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: protocols
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: files
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: filepath
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: filesize
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: fileid
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: certfile
0000000000:000054813:OS:Warning: [prvParseJSONbyModel] parameter not present: sig-sha256-ecdsa
0000000000:000054813:OS:Warning: [prvDefaultCustomJobCallback] Received Custom Job inside OTA Agent which is not supported.
0000000000:000054814:OS:Warning: [prvParseJobDoc] Ignoring job without ID.
0000000000:000054814:OS:Warning: [prvOTA_Close] Context->0x0x2404d668
0000000000:000054814:OS:Warning: [prvPAL_Abort] Abort - OK
0000000000:000054814:OS:Warning: [prvPAL_SetPlatformImageState] Aborted image.
0000000000:000067751:OS:Warning: [prvOTAAgentTask] Handler failed. Current State [WaitingForJob] Event  [ReceivedJobDocument] Error Code [603979776] 

I have traced this down to the fact that DescribeJobExecution for the $next job when there is no job returns a JSON document with no execution node. This is passed into prvParseJSONbyModel and the default custom job handler (which cannot be override easily) which fails and returns to prvProcessJobHandler. As the xOTAFileContext is null, and the image state is not in testing, it aborts the job as though there was a parse error, and on this PAL forces an erasure of the 2nd bank of flash.

Using the MQTT Client in the IoT portal, I have executed the DescribeJob for the $next job as per https://docs.aws.amazon.com/iot/latest/developerguide/jobs-api.html#mqtt-describejobexecution
with

{ "clientToken":"Fred"}

as the payload, and the result is

{
  "clientToken": "Fred",
  "timestamp": 1582855574
}

and the absence of the "execution" node can clearly be seen.

What is the best way to handle this. I see one option being to add some way to set xOTA_Agent.xPALCallbacks.xCustomJobCallback with a callback that does not return an error (and would have the nice side effect of providing a mechanism for custom jobs to be handled by our firmware.

Or is this a configuration/misunderstanding of how it is supposed to work?

Thanks in advance.
Matthew

Hi Mathew,

Thank you for reporting this.

The OTA_AgentInit_internal function can be used to initialize and register all the custom callbacks instead of OTA_AgentInit.

Can you please share details about the platform you are using and the FreeRTOS release ? We will try to reproduce this and provide more explanation or solution if required.

The platform is based on a STM32 port, and the release is based from the 202002 tag, Unfortunately due to NDA I cannot elaborate much more on the STM32 side, however, whilst investigating this issue, I did identify that other people in this forum had seen similar symptoms.

I will try the OTA_AgentInit_internal function, and see if by returning “no error” stops this issue.

Thanks for the response.

After trying the OTA_AgentInit_internal function, it still didn’t give the desired response due to there being no valid JobContext.

Cheers

Matthew

Hello Matthew !

We have created an internal ticket to address the problem you are facing. We will track the progress in the ticket and I’ll post updates here if you prefer. Meanwhile if you have any questions regarding this issue, feel free to reach out on this thread.

Thanks.

1 Like

Hi Abhi,
Any updates as you make them would be very much welcome (please note that there is also an AWS support case referencing this issue).

Cheers

Matthew

1 Like

Hi Mathew,

Thank you for sharing details about the platform and build you are using. We can not reproduce this issue as you have a custom OTA PAL port on your platform.

However from the logs it seems that the issue is with the OTA PAL port in prvPAL_Abort implementation. When the device reboots and next job document is requested with no pending jobs in the OTA service an empty job document will be received. This job document parsing will result in error as required fields will not be present and prvOTA_Close is called. This in turn calls the OTA pal function prvPAL_Abort which has input parameter OTA_FileContext_t * const C.

In the implementation of this function we must check if C is not NULL and then if C->pxFile is not NULL before erasing the bank.

Please check if this is handled in your OTA PAL port. You can refer the FreeRTOS visual studio simulator OTA PAL port code at https://github.com/aws/amazon-freertos/blob/a6a459b5e498f92db5ab16d6bf69f8d449d8c8e9/vendors/pc/boards/windows/ports/ota/aws_ota_pal.c#L115

Please let me know if this works for you.

If you want to use custom job please refer the AWS blog with details https://aws.amazon.com/blogs/iot/how-to-perform-secondary-processor-over-the-air-updates-with-freertos/

Thanks for that information, unfortunately this didn’t solve the issue, as it is not prvPAL_Abort that is causing the bank to be erased, it’s stemming from prvPAL_SetPlatformImageState being called with eOTA_ImageState_Aborted which I believe is being triggered by the following logic

xOTAFileContext = prvGetFileContextFromJob( ( const char * ) pxEventData->ucData,
                                            pxEventData->ulDataLength );

/*
 * A null context here could either mean we didn't receive a valid job or it could
 * signify that we're in the self test phase (where the OTA file transfer is already
 * completed and we've reset the device and are now running the new firmware). We
 * will check the state to determine which case we're in.
 */
if( xOTAFileContext == NULL )
{
    /*
     * If the OTA job is in the self_test state, alert the application layer.
     */
    if( OTA_GetImageState() == eOTA_ImageState_Testing )
    {
        /* Send event to OTA task to start self-test. */
        xEventMsg.xEventId = eOTA_AgentEvent_StartSelfTest;
        OTA_SignalEvent( &xEventMsg );

        xReturn = kOTA_Err_None;
    }
    else
    {
        /*
         * If the job context returned NULL and the image state is not in the self_test state,
         * then an error occurred parsing the OTA job message.  Abort the OTA job with a parse error.
         *
         * If there is a valid job id, then a job status update will be sent.
         */
        ( void ) prvSetImageStateWithReason( eOTA_ImageState_Aborted, kOTA_Err_JobParserError );

        xReturn = kOTA_Err_JobParserError;
    }
}

in aws_iot_ota_agent.c (around line 828) (as prvGetFileContextFromJob will return null), and the image is not in testing.

Also, thanks for the information regarding custom jobs.

Cheers

Matthew

Hello,

Thank you for sharing more details. When we have the condition of receiving empty job document it is considered invalid in OTA Agent as required fields are missing and calls prvPAL_SetPlatformImageState with eOTA_ImageState_Aborted. I will investigate if we can identify the job document as empty and ignore it. Regarding the OTA pal port, I am not familiar with the port you are using but if the entire bank is being erased when the platform is not in self-test and prvPAL_SetPlatformImageState is called with abort then it can be removed in OTA pal port. The bank must be erased when we receive a valid job document in OTA pal function prvPAL_CreateFileForRx.