Cellular OTA up and running, but way too slow

After a lot of debugging I finally managed to get OTA working over a cellular connection. However the OTA update runs extremely slow, taking over an hour to update an application which is just roughly 200kb.

My workflow was that I ported the window demo to my platform, changed the transport layer to that of the cellular library and implemented the PAL layer.

Analyzing the log, two recurring log items pop up:

[INFO] [Sockets] [sockets_wrapper.c:322] prvNetworkRecv timeout

and

[ERROR] [OTADemo] [OtaOverMqttDemoExample.c:938] Error: No OTA data buffers available.

Eventually the update is successful, but practically, with intermittent cellular connectivity, this is too slow to work.

I have attached a part of the log. I’m hoping someone here can point out what the possible problem could be. It is hard for me to debug as it is sort-of working, and I do not understand the innerworkings of the OTA agent fully.

I initially thought the OTA demo was running slow due to having a low RTOS priority, so I changed it to max priority but it is still running super slow.

OTA_Log_Slow.zip (3.3 KB)

Going through the error logs:

prvNetworkRecv timeout comes from a socket wrapper file in your project. Could you provide little more details as to what the function does and from where the function is called?

Are you using a TCP/IP stack offloaded to the cellular module ? If so is it possible to set the socket receive to non blocking. Our reference example sets the TCP socket to non blocking after a TLS connection is established as shown here. This will allow MQTT process loop to not block for a data from socket, if there is no data available. This have shown considerable improvement in OTA download speeds as it will reduce the block time between each OTA windows.

Regarding the error: Error: No OTA data buffers available. , It seems number of OTA buffers allocated to handle incoming packets is not enough. The configuration value is defined as otaconfigMAX_NUM_OTA_DATA_BUFFERS in file ota_config.h and this should be set to at-least one greater than otaconfigMAX_NUM_BLOCKS_REQUEST defined in same config file.

Thank you so much! That was the solution!
In the sockets_wrapper.c of the cellular library, the default is
#define CELLULAR_SOCKET_RECV_TIMEOUT_MS ( 1000UL )
so I changed it to zero.

For the OTA buffer, there was no definition of otaconfigMAX_NUM_OTA_DATA_BUFFERS, in my ota_config.h file, which loaded the values from ota_config_defaults.h, were the default value was only 1.

The update now takes just 2 minutes.

Really appreciate the great help here on the forum!

I see in my latest logging that despite setting CELLULAR_SOCKET_RECV_TIMEOUT_MS to zero, I’m still getting very frequent timeouts, the log entries are:

[INFO] [Sockets] [sockets_wrapper.c:322] prvNetworkRecv timeout

I think if this gets resolved my OTA process will be much faster. Does anyone know how to solve this?

I’ve added the code that sets the TCP socket to non blocking after TLS has been established like @ravibhagavandas recommended in the comment above. It is the addition of:

if( xConnected )
{
    ( void ) FreeRTOS_setsockopt( pxNetworkContext->pParams->tcpSocket,
                                  0,
                                  FREERTOS_SO_WAKEUP_CALLBACK,
                                  ( void * ) prvMQTTClientSocketWakeupCallback,
                                  sizeof( &( prvMQTTClientSocketWakeupCallback ) ) );

    ( void ) FreeRTOS_setsockopt( pxNetworkContext->pParams->tcpSocket,
                                  0,
                                  FREERTOS_SO_RCVTIMEO,
                                  &xTransportTimeout,
                                  sizeof( TickType_t ) );
}

However, when I add that piece of code, everyting builds ok, but I get a stack overflow of the OTA Demo task. So I increased the stacked size, and even increased it by a factor of 10 and am still getting an overflow. I do not understand why the addition of the code above requires so much extra stack.
Ideas anyone?

Is this is the task doing TLS handshake? If yes, that may be the reason of high stack requirement. What is the stack size you are using?

The parent task (OTA Demo Task) that overflows upon adding the code snippet only spawns two new child tasks, the MQTT agent and the OTA task. The TLS setup is performed in the MQTT agent child task. Both these child tasks have their own stack and work just fine. The parent works just fine without the code snippet with a stack size of 1024, but with the code snippet added to the MQTT agent child task, even when the stack for the parent is increased to a whopping 48000, the stack overflow hook gets triggered and points to the parent task as the origin of the overflow.

That’s s a huge stack and I think it shouldn’t be needed.
So either a bug corrupting the stack memory by overflowing/overwriting a local buffer variable or a recursion problem could make the task go wild.
Can you put a breakpoint on prvMQTTClientSocketWakeupCallback and track it ?

I tried tracing down the problem, but it is illusive. I do know that it happens only due to this piece of the code snippet:

	( void ) FreeRTOS_setsockopt( pxNetworkContext->pParams->tcpSocket,
                                  0,
                                  FREERTOS_SO_WAKEUP_CALLBACK,
                                  ( void * ) prvMQTTClientSocketWakeupCallback,
                                  sizeof( &( prvMQTTClientSocketWakeupCallback ) ) );

When debugging, then after the FreeRTOS_setsockopt returns, I get the stack overflow.

The only difference I had to make to this code section is I changed to pxNetworkContext->pParams->tcpSocket compared to the example code pxNetworkContext->tcpSocket. I had to do this because in my code pxNetworkContext had no member named tcpSocket.

Can it be that this is not allowed?

Which task does the stack overflow occur in? The task calling setsockopt(), or the task that calls prvMQTTClientSocketWakeupCallback()?

If you put a breakpoint in prvMQTTClientSocketWakeupCallback(), does it ever get called? If so, step through the function to see what is causing the issue. If not, then look where the function gets called from to see if there is an issue dereferencing something set during the setup.

The overflow occurs in otaDemoTask. That task starts 2 new tasks, prvOTAAgentTask and prvMQTTAgentTask.
The prvMQTTAgent task calls prvSocktConnect in which the following line triggers the overflow:
( void ) FreeRTOS_setsockopt( pxNetworkContext->pParams->tcpSocket,
0,
FREERTOS_SO_WAKEUP_CALLBACK,
( void * ) prvMQTTClientSocketWakeupCallback,
sizeof( &( prvMQTTClientSocketWakeupCallback ) ) );

The prvMQTTClientSocketWakeupCallback never gets called.

After FreeRTOS_setsockopt returns, I can see that pxNetworkContext->pParams->tcpSocket->pxUserWakeCallback was overwritten with the pointer value of prvMQTTClientSocketWakeupCallback, so that seems correct. However when I then take one step further in the debugger I end up in the stackoverflow hook.

Completely stuck here, any help or tips is greatly appreciated.

From your post, it seems that you are using the cellular socket, through this wrapper file: sockets_wrapper.h . Is that correct?

The code posted above (FreeRTOS_setsockopt) set’s socket options for a FreeRTOS+TCP connection. This is expected if the FreeRTOS+TCP is used as TCP/IP stack on board.

So, could you clarify few things:

  1. Is TCP/IP stack running on the main MCU board or on the cellular modem ? If it’s on MCU, are you using FreeRTOS+TCP?
  2. Are you using one of the Cellular modules already ported to the cellular interface here ?

Hi Ravishankar,

Thank you for your help

  1. I’m using the TCP/IP stack of the modem, so calling FreeRTOS_setsockopt was the wrong thing to do. Now I’m looking into the sockets wrapper (attached). It already uses Cellular_SocketSetSockOpt to set the receive timeout to zero. I then also wanted to setup the prvMQTTClientSocketWakeupCallback as you suggested, but I can’t as the cellular socket does not have a wakeup callback option. How can I approach this?
  2. Yes I’m using the cellular module that is already ported, the sara-r4.

Just restating my main problem; I am seeing a lot of these log entries:
[INFO] [Sockets] [sockets_wrapper.c:322] prvNetworkRecv timeout

and my cellular OTA seems to be slow, it takes 2 minutes to complete, and I think the receive timeouts are slowing it down. I do not know why the timeouts occur, as the sockets_wrapper already sets the CELLULAR_SOCKET_OPTION_RECV_TIMEOUT to zero (in line 667, see attachment).

sockets_wrapper.c (36.1 KB)

The call sequence is like this –

  1. When the socket is created, the recv timeout is set to 0 and data ready callback is registered.
  2. Sockets_Recv calls prvNetworkRecvCellular which in-turn calls Cellular_SocketRecv and expects the prvCellularSocketDataReadyCallback to be invoked.
  3. If the callback is not invoked (probably because there is no data on the socket), you see the timeout message.

So, this timeout message just means that the no data was received when Sockets_Recv was called. I am not sure we can conclude that it is the reason of the slowdown. Have you tried profiling and see which part is bottleneck?

I do not have a profiling tool chain integrated into my code, and that is quite some work to set up it seems. Even if I can set that up, the times it will give me won’t tell me anything as I do not have a reference.
I reasoned that getting frequent timeout messages is not a good thing, and I have no reference of how fast a normal cellular OTA job should be. For me, the firmware is 400kb and it takes about 2 minutes to update, is that a normal time?
If that is normal and getting a lot of socket receive timeout log entries is also normal, than I do not have a problem, I will then just comment out the socket receive timeout log line for better readability of my log.
I’m not sure if I need to take further action at this point, what do you recommend? leave it as-is and disable the log entry, or continue to search to possibly shorten the 2 minutes OTA time?
I do not know if my code is working as it should, or somehow running in a suboptimal way…

Hi Mike,

Sorry for late reply. The log you mentioned indicates that the modem hasn’t received any data from network to this socket within CELLULAR_SOCKET_RECV_TIMEOUT_MS. From your description, it could be slow network speed since the data is still received after retry by upper layer software. You can just ignore the log.

I use SARA-R4 AT command for explanation. SARA-R4 sends “+UUSORD” URC to indicate data received from network to this socket. Once the cellular library received the “+UUSORD” URC, it will send the “AT+USORD” command to retrieve the data in modem.

# Cellular library received socket data received URC
+UUSORD: 1,10

# Cellular library send read socket data command
AT+USORD=1,10

+USORD: 1,10,"1234567890"

If “+UUSORD” is not received within CELLULAR_SOCKET_RECV_TIMEOUT_MS, the following log will be printed to inform user that the modem hasn’t received any data. User can still retry waiting for the URC by calling Sockets_Recv again.

 [INFO] [Sockets] [sockets_wrapper.c:322] prvNetworkRecv timeout

There are many factors that can affect the data transfer speed like baud rate, slow network speed, cpu processing time…

If you are using baud-rate 115200 to communicate with SARA-R4, the upper limit to receive 400KB is 27.78 seconds. Increase the baud-rate may improve the data transfer speed. However, the maximum download speed could still be limited by other factors, for example the network speed.

The info log reduces user readability. We could try to lower the log level to debug. Thank you for your experience sharing.

Thank you for the clarification! I will disable the socket receive timeout and take no further action.