OTA library and excessive duplicate blocks received

nickm2018 wrote on July 06, 2019:

Lately, i am seeing a lot more duplicate blocks received when downloading an image.
This is extending a update by minutes, sometimes longer.

For example, one device had 404 “is a duplicate” messages. This caused the update to take about 4 additional minutes versus the other 30 devices in the same pool.

Is there something i should look at to debug this issue? Could something have changed on the server side’s logic? i am running an older ota library but according to the release notes, nothing major has changed there (versus the major change on the mqtt interface). I am not interested in updating the mqtt layer to the latest library version as what we had has been stable, has been verified, and our platform is at a mature stage.


                    if ( ( C->pacRxBlockBitmap[ulByte] & ulBitMask ) == 0U )    /* If we've already received this block... */
                        {
                            OTA_LOG_L1("[%s] block %u is a DUPLICATE. %u blocks remaining.\r
", OTA_METHOD_NAME,
                                       ulBlockIndex,
                                       C->ulBlocksRemaining );
                            eIngestResult = eIngest_Result_Duplicate_Continue;
                            *pxCloseResult = kOTA_Err_None;                         /* This is a success path. */
                        }

Library versions
Amazon FreeRTOS MQTT Agent V1.1.3
Amazon FreeRTOS MQTT Library V1.1.3
Amazon FreeRTOS OTA Agent V1.0.0

prasadvya wrote on July 08, 2019:

Hi nickm2018,

The OTA agent requests data blocks from the OTA streaming service using the request timer. The request timer is reset whenever a data block is received from the OTA service so the request message is only sent after being idle for this amount of time.

The timer configuration is located in the aws_ota_agent_config.h file and defined as -
#define otaconfigFILE_REQUEST_WAIT_MS 16000U

If you are receiving duplicate packets it means the data requests are being sent too fast and we should increase this timer based on the worst case expected network condition and platform being used.

Thank you for reporting this and let me know if higher timeout reduces duplicate packets.

Edited by: prasadvya on Jul 8, 2019 10:17 AM

nickm2018 wrote on July 12, 2019:

Thanks, i will look at adjusting this value. In my configuration it is currently set to 2500.

I have not touched this in a long time, definitely before we considered the ota system stable.

I have a fairly large OTA image, > 1.17MB so i do not want to make this delay too long.

Edited by: nickm2018 on Jul 12, 2019 2:09 PM

nickm2018 wrote on July 12, 2019:

nickm2018 wrote:
Thanks, i will look at adjusting this value. In my configuration it is currently set to 2500.

I have not touched this in a long time, definitely before we considered the ota system stable.

I have a fairly large OTA image, > 1.17MB so i do not want to make this delay too long.

Looks like 2500 is the value that is used in the current version of the library. Where did 16000 come from? Is this from internal testing? Does the fix me comment indicate that the user should experimentally set this value based on use cases?


/**
 * @brief Milliseconds to wait before requesting data blocks from the OTA service if nothing is happening.
 *
 * The wait timer is reset whenever a data block is received from the OTA service so we will only send
 * the request message after being idle for this amount of time.
 */
#define otaconfigFILE_REQUEST_WAIT_MS           2500U   /* FIX ME. */

PrasadV-AWS wrote on July 16, 2019:

Hi nickm2018,

The otaconfigFILE_REQUEST_WAIT_MS is the OTA configuration value that is set as default to 2500 msecs for running the OTA demo. This is set to higher value in some of the test projects.

The otaconfigFILE_REQUEST_WAIT_MS is the timeout for request timer which is reset every time a block is received. It means if we do not receive even a single block till otaconfigFILE_REQUEST_WAIT_MS the agent request the data blocks again from the OTA streaming service. And you are right this should be adjusted based on data block size, network, latency etc as per use case.

Let me know if this helps.

Edited by: PrasadV-AWS on Jul 29, 2019 11:40 AM

nickm2018 wrote on July 30, 2019:

What would cause the server to send the same set of data blocks again?

What ever happens, it happens often on some devices where it enters a loop of sending the same set of data blocks over and over until i time out at the application layer.

Edited by: nickm2018 on Jul 30, 2019 10:51 AM

PrasadV-AWS wrote on July 30, 2019:

Hi nickm2018,

The OTA Agent uses Q0S 1 for the control messages and QoS 0 for the data messages. The agent has a timer which is used for waiting for data blocks from service and requesting them again if not received in that time frame as it is QoS 0. This is the same timer which I mentioned uses the configuration parameter otaconfigFILE_REQUEST_WAIT_MS for which the default value is 2500 ms to run the OTA demo. If we have slower network or network congestion then this timer will be too fast and same data blocks will get requested form the OTA service which might be already in transit/queued. Increasing the otaconfigFILE_REQUEST_WAIT_MS to a higher value depending on the network will help. What is the application level timeout you are using ?

nickm2018 wrote on July 31, 2019:

I extended the OTA_FileRequest timeout to 16 seconds and i still see duplicates

Here is an example from an OTA of 38 devices.

Block 19,19, and 29 are received and written to flash. In the same reported second, I see block 12 being received a second time. With QoS of 0, i would not expect to see any MQTT retries.

“7/31/2019 6:08:44 PM”,738617 [OTA Task] [prvPublishGetStreamMessage] OK: $aws/things/CC_C0_79_DF_55_A4/streams/AFR_OTA-d3d23e81-33a6-40aa-96f0-6f506d07481c/get/cbor
“7/31/2019 6:08:44 PM”,738617 [OTA Task] [prvPublishGetStreamMessage] Momentum = 1
“7/31/2019 6:08:44 PM”,738617 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:49 PM”,743974 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:49 PM”,743974 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:49 PM”,744085 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:49 PM”,744087 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:49 PM”,744123 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:49 PM”,744143 [MQTT] Received fixed header, 4209 bytes to receive.
“7/31/2019 6:08:49 PM”,744204 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:50 PM”,744518 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:50 PM”,744519 [OTA Task] [prvIngestDataBlock] Received file block 29, size 4096
“7/31/2019 6:08:50 PM”,744550 [OTA Task] [OTA] Flash Write 4096 @ 118784. BW: 122880 Header: 0
“7/31/2019 6:08:50 PM”,744550 [OTA Task] [prvIngestDataBlock] Remaining: 265
“7/31/2019 6:08:50 PM”,744550 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.
“7/31/2019 6:08:50 PM”,744716 [MQTT] Received fixed header, 4208 bytes to receive.
“7/31/2019 6:08:50 PM”,744819 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:50 PM”,744819 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:50 PM”,745133 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:50 PM”,745134 [OTA Task] [prvIngestDataBlock] Received file block 19, size 4096
“7/31/2019 6:08:50 PM”,745162 [OTA Task] [OTA] Flash Write 4096 @ 77824. BW: 126976 Header: 0
“7/31/2019 6:08:50 PM”,745162 [OTA Task] [prvIngestDataBlock] Remaining: 264
“7/31/2019 6:08:50 PM”,745162 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.
“7/31/2019 6:08:50 PM”,745221 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:50 PM”,745221 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:50 PM”,745284 [MQTT] Received fixed header, 4208 bytes to receive.
“7/31/2019 6:08:50 PM”,745290 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:50 PM”,745341 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:50 PM”,745361 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:50 PM”,745400 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:51 PM”,745560 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:51 PM”,745561 [OTA Task] [prvIngestDataBlock] Received file block 12, size 4096
“7/31/2019 6:08:51 PM”,745589 [OTA Task] [OTA] Flash Write 4096 @ 49152. BW: 131072 Header: 0
“7/31/2019 6:08:51 PM”,745589 [OTA Task] [prvIngestDataBlock] Remaining: 263
“7/31/2019 6:08:51 PM”,745589 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.
“7/31/2019 6:08:51 PM”,745623 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:51 PM”,745623 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:51 PM”,745673 [MQTT] Received fixed header, 4208 bytes to receive.
“7/31/2019 6:08:51 PM”,745889 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:51 PM”,745891 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:51 PM”,746051 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:51 PM”,746052 [OTA Task] [prvIngestDataBlock] Received file block 12, size 4096
“7/31/2019 6:08:51 PM”,746052 [OTA Task] [prvIngestDataBlock] block 12 is a DUPLICATE. 263 blocks remaining.
“7/31/2019 6:08:51 PM”,746052 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.
“7/31/2019 6:08:51 PM”,746150 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:51 PM”,746150 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:51 PM”,746185 [worker thread] Socket sending wakeup to MQTT task.
“7/31/2019 6:08:51 PM”,746245 [MQTT] Received fixed header, 4208 bytes to receive.
“7/31/2019 6:08:51 PM”,746297 [MQTT] Received message 0 from queue.
“7/31/2019 6:08:52 PM”,746620 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:52 PM”,746624 [OTA Task] [prvIngestDataBlock] Received file block 19, size 4096
“7/31/2019 6:08:52 PM”,746624 [OTA Task] [prvIngestDataBlock] block 19 is a DUPLICATE. 263 blocks remaining.
“7/31/2019 6:08:52 PM”,746624 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.
“7/31/2019 6:08:52 PM”,746812 [MQTT] Received fixed header, 4209 bytes to receive.
“7/31/2019 6:08:52 PM”,747285 [OTA Task] [prvStartRequestTimer] Starting OTA_FileRequest timer.
“7/31/2019 6:08:52 PM”,747286 [OTA Task] [prvIngestDataBlock] Received file block 29, size 4096
“7/31/2019 6:08:52 PM”,747286 [OTA Task] [prvIngestDataBlock] block 29 is a DUPLICATE. 263 blocks remaining.
“7/31/2019 6:08:52 PM”,747286 [OTA Task] [prvOTAUpdateTask] Returned buffer to MQTT client.

nickm2018 wrote on July 31, 2019:

At the application level i have added a 6 minute timer that starts when the ota agent transitions to active state. If the job is not completed in that time, i close the ota agent to reset it.

With my image size and 2.5 seconds for the request delay i think my updates are between 4.5 minutes and 5 minutes. I will run a test with extending the request delay to 10s and upping my application timeout to 10 minutes.

PrasadV-AWS wrote on August 19, 2019:

Hi nickm2018 ,

The OTA agent uses QoS 0 for requesting the data blocks and that is the reason we have a timeout to request the same blocks again if not received in OTA_FileRequest timeout. The timer is reset every time a data block is received.

If you have increased the timeout to 16 seconds and still duplicate blocks are getting requested it means device is not receiving a single data blocks in that time frame. You mentioned you have 38 devices , are all devices connected to same access point or suspect any other issue with slow network? Also have you increased the default data block size from 1 KB?

nickm2018 wrote on January 17, 2020:

Just to confirm, i am seeing the duplicate blocks at 16 seconds for the request timer.

PrasadV-AWS wrote on January 18, 2020:

Hello nickm2018,

Thank you for sharing more details regarding the issue. Can you please confirm if you are still using the OTA Library version 1.0.0 ?

We have released features where we can control the number of packets requested from the device and other improvements in last few releases.

nickm2018 wrote on January 17, 2020:

PrasadV-AWS wrote:
Hi nickm2018 ,

The OTA agent uses QoS 0 for requesting the data blocks and that is the reason we have a timeout to request the same blocks again if not received in OTA_FileRequest timeout. The timer is reset every time a data block is received.

@PrasadV-AWS
About 4 months back i modified the library to start at around 8s for the request timer but scale it back every time duplicate blocks are received, to a max of 16s. Our images are too large to default to 16s all the time.
This has been working great until today when all of a sudden 40 to 50% of our devices are getting stuck in a duplicate block loop.
We have about 100 devices across 3 Aruba commercial AP's on a internet connection with no known throughput or latency issues.

Yes, we are using block sizes of 4KB in order to max the block size of our flash.

Hello all,

It’s been a while since this thread was visited but I am currently dealing with a similar issue with duplicate blocks. The issue does not always occur but it seems to be the same behavior as was described in this thread.

Wondering if anything else was done to alleviate the problem besides increasing the value of otaconfigFILE_REQUEST_WAIT_MS?

Did increasing otaconfigFILE_REQUEST_WAIT_MS not solve the problem for you?

It appears to reduce the frequency of the issue occurring but it does not solve it. My application is a bit slow because I am communicating via a cellular modem. I increased otaconfigFILE_REQUEST_WAIT_MS all the way to 30 seconds and only saw the issue occur once out of 20 updates (to a single device, not a fleet) where as it was much more frequent at a lower value.

This is a prototype development so the 30 second value is acceptable for now but it doesn’t seem to be a complete solution unless I need to increase it even further (still in the middle of testing). I feel that the application code needs some additional piece of logic which I need to add to handle the situation.

That isn’t necessarily a question for this thread, I was mainly curious if anything else came from this 3 years ago.

Hi Daniel,
What OTA library are you using for the test?
We’ve patched for duplicated download by PR#428 on OTA repo.
Please help to try.

Thanks.

I had been using v3.0.0 for the OTA library. I have since increased otaconfigFILE_REQUEST_WAIT_MS to 40sec among a few other changes in my application and have not encountered the duplicate block issue in a while. That being said, I will try to update the OTA library and run some more tests.

Generally speaking, is it recommended to stay up to date with new commits on the main branch for this and other official submodules?

Hi Daniel,
You’re using the stable version (tagged) of submodule, and it’s good to know you can pass your tests with it. There’s no problem if you can achieve your goal with it.

Even though we keep updating every library with our best effort, I still suggest you to use stable version, which is well qualified one. In my opinion, if you meet some critical issues and there is no solutions in that stable version, you can try it with main branch.

Thanks.

It is recommended to use a released version in production - Releases · aws/ota-for-aws-iot-embedded-sdk · GitHub