vTaksDelay stops scheduler without resuming it after a few calls

Hi,

I working on OTA over cellular on a STM32. First goal is to get the MQTT Agent running and I’m nearly there. When I run this loop, it runs for a few cycles before it hangs on a configASSERT:

    	while(1){
    		//static OtaMqttStatus_t prvMQTTPublish(const char* const pacTopic, uint16_t topicLen, const char* pMsg, uint32_t msgSize, uint8_t qos);
    		sprintf(messagestring,"Hello World!...Iteration=[%li]",i);
    		prvMQTTPublish(topicstring, strlen(topicstring), messagestring, strlen(messagestring), 1);
    		vTaskDelay(5000);
    		i++;
    	}

The configASSERT in question is configASSERT( uxSchedulerSuspended == 0 );

vTaskDelay() calls VTaskSuspendAll(); and normally it calls xTaskResumeAll(); later. This works for a few cycles of my loop, but then for some reason xTaskResumeAll(); is not called, which triggers the assert on the next time vTaskDelay() is called.

I’m quite lost as to why this occurs, does anyone have a hunch?

For reference, here’s where this happens in tasks.c:

void vTaskDelay( const TickType_t xTicksToDelay )
{
BaseType_t xAlreadyYielded = pdFALSE;

	/* A delay time of zero just forces a reschedule. */
	if( xTicksToDelay > ( TickType_t ) 0U )
	{
		configASSERT( uxSchedulerSuspended == 0 );
		vTaskSuspendAll();
		{
			traceTASK_DELAY();

			/* A task that is removed from the event list while the
			scheduler is suspended will not get placed in the ready
			list or removed from the blocked list until the scheduler
			is resumed.

			This task cannot be in an event list as it is the currently
			executing task. */
			prvAddCurrentTaskToDelayedList( xTicksToDelay, pdFALSE );
		}
		xAlreadyYielded = xTaskResumeAll();
	}
	else
	{
		mtCOVERAGE_TEST_MARKER();
	}

	/* Force a reschedule if xTaskResumeAll has not already done so, we may
	have put ourselves to sleep. */
	if( xAlreadyYielded == pdFALSE )
	{
		portYIELD_WITHIN_API();
	}
	else
	{
		mtCOVERAGE_TEST_MARKER();
	}
}

First thing to check would be the ‘usual suspects’ - things like the interrupt priorities and stack overflows. This page provides pointers to the relevant information on things like configASSERT() and configCHECK_FOR_STACK_OVERFLOW https://freertos.org/FAQHelp.html

Which version of FreeRTOS are you using? The newer the more configASSERTS() there are to catch these things.

I have implemented the vApplicationMallocFailedHook() as well as the vApplicationStackOverflowHook() and these do not get triggered.

There is only one interrupt in my own code that can occur during this, which is when the cellular modem sends a command. My code needs a UART receive-idle interrupt for this, which fires after a serial command line has been received from the modem.

Could this interrupt be messing with vTaskDelay(); ?

I’m very new to FreeRTOS so I’m probably overlooking some beginners mistake.

I’m running FreeRTOS 10.3.1, generated by STM32 CubeMX

Sorry. Deleted stupid question as the answer is in your first post.

In which case we will need a little more information. If you are sending publish messages I assume you have already connected to the MQTT broker. Did you use one of the examples provided in either of the FreeRTOS or AWS git repos as a starting point? Or maybe STM32Cube as the starting point? Is that where the implementation of prvMQTTPublish() comes from?

What I did was I took the OTA demo, intended for running in a windows simulator, and ported that to my STM32 by changing the transport interface and PAL interface. I wrote a custom driver for the cellular modem, which is the glue between the modem AT commands and the socket needed for the transport layer.

The prvMQTTpublish function works. I can see the messages coming in inside my AWS console. But it only works about 5 times, then the scheduler stops as the TaskDelay suspends it for some reason.

Try publishing a constant string each time, rather then creating a new string each time. That would remove the possibility of the sprintf() function causing an issue (implementations can do unexpected things), and potentially the buffer being access from more than one thread simultaneously (just looking for something that could cause a data corruption). Some something like:

static const char * const StringToSend = "Send me to the clouds";

So its not on the stack either.

If that doesn’t move you forward please attach your FreeRTOSConfig.h file here (or send it to me a r dot barry at freertos dot org if you don’t want to attach).

Ok, I tried your suggestion to no avail unfortunately. After a few publishes the scheduler is suspended again. Attached is my FreeRTOSConfig.h

FreeRTOSConfig.h (6.4 KB)

As a further observation, the unexpected suspension of the scheduler seems to be happening just after data is received from the cellular modem. I’m running a state machine for handling the modem in a separate task. The state machine is driven by UART interrupts from the modem which can change the state. The interrupt routine for this is given below. As data reception seems to trigger the problem, I am possibly doing something wrong here, I’m just not seeing it.

void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size)
{

	if(huart->Instance == USART3){ //data received from SimCom module
		received_length=Size;
		if(DEBUGOUT==1)HAL_UART_Transmit(&huart1, (uint8_t *) &RxBuf[0], Size, -1); //Echo reception to debugport
		repbuf[29]=0;//null termination for string
		memcpy(repbuf, RxBuf, 25);


		if(strstr((char*)&repbuf, "OK") != NULL) {
			ok_flag=1;
		}
		if(strstr((char*)&repbuf, (char*)&repstr) != NULL) {
			reply_confirmed_flag=1;
			memcpy(MainBuf, RxBuf, Size);
		}

		if(strstr((char*)&repbuf, "+CREG: 1") != NULL){
			simcomstate=cellular_connected;
		}
		if(strstr((char*)&repbuf, "+CREG: 2") != NULL){
			simcomstate=cellular_connecting;
		}
		if(strstr((char*)&repbuf, "+CREG: 0") != NULL){
			simcomstate=cellular_connecting;
		}

		if(strstr((char*)&repbuf, "+CPIN: NOT INSERTED") != NULL){ //no simcard detected
			simcomstate=error;
		}

		if(strstr((char*)&repbuf, "+CASTATE: 0,0") != NULL){ //TCP connection is closed by remote server or internal simcom error
			simcomstate=tcp_connection_lost;

		}

		if(strstr((char*)&repbuf, "+CADATAIND") != NULL){
			simcomstate=tcp_data_received;
		}
		if(strstr((char*)&repbuf, "+CASTATE: 0,0") != NULL){ //register the disconnect
			tcp_connection_flag=0;
		}

		if(strstr((char*)&RxBuf, "ERROR") != NULL) error_flag=1;

		reply_flag=1;
		data_received=1;

		HAL_UARTEx_ReceiveToIdle_DMA(&huart3, RxBuf, RxBuf_SIZE);
	}
	if(huart->Instance == USART1){ //data received from serial debug interface

		if(strstr((char*)&RxBuf_debugport, "TEST") != NULL){
			simcomstate=teststate;
		}
		else HAL_UART_Transmit(&huart3, (uint8_t *) &RxBuf_debugport[0], Size, -1); //forward message to SimCom module

		HAL_UARTEx_ReceiveToIdle_DMA(&huart1, RxBuf_debugport, RxBuf_SIZE);
	}

}

Ammending this post as am working on it. Another observation; when I do not call MQTTPublish and just wait in a loop, the MQTTAgent sends a MQTT ping once a minute to keep the connection alive. This goes well for a couple of minutes, after which the scheduler gets suspended and the application hangs. So whatever is causing this, it is not in my own code that does the cyclic publish.

Spent yet another day chasing this bug without success. An important thing to note is that when I set the publish QoS to 0, I’ve been able to publish at 1 Hz for 1 hour without errors. When QoS is 1, the error is still there, it works for a few times, then the scheduler is suspended. So there must be something wrong in receiving.
I have not tried subscribing and receiving yet, but that will very likely also not work…
These are the kind of things that consume so much time, I have to suck it up and keep grinding… Hope you guys here can help out, I’m stumbling in the dark at the moment.

What is the priority of UART Rx interrupt? Also, have you changed the HAL Tick Base using STM32CudeIDE?

For the tick base I chose an available unused timer. The priority for the UART Rx DMA is level 5.

That seems correct. We need to narrow down the problem - Can you try removing this debug port from your ISR so that we do not need to initiate a UART Tx. Also, can you share call stack (possibly a debugger snapshot) when the assert fires?

Thanks.

Ok, I removed the UART Tx, problem persists. Attached is a screenshot of the general registers when the assert is triggered. I’m assuming this is what you meant?

Seems like a memory corruption. What is the value of uxSchedulerSuspended at the time of assert?

At the time of assert:

Name : uxSchedulerSuspended
Details:134285590
Default:134285590
Decimal:134285590
Hex:0x8010916
Binary:1000000000010000100100010110
Octal:01000204426

another attempt to run yielded:

Name : uxSchedulerSuspended
Details:134285616
Default:134285616
Decimal:134285616
Hex:0x8010930
Binary:1000000000010000100100110000
Octal:01000204460

This does not look correct and looks like the memory is corrupted. Lets try the following:

  1. Define a new variable right after the definition of uxSchedulerSuspended.
PRIVILEGED_DATA static volatile UBaseType_t uxSchedulerSuspended_Test = ( UBaseType_t ) pdFALSE;
  1. Start debugger and put a breakpoint in your application task (StartDefaultTask).
  2. Put a data breakpoint at the location of your new variable uxSchedulerSuspended_Test.
  3. Remove the breakpoint created in step 2 and let the debugger continue.

The expectation is that the linker places uxSchedulerSuspended_Test and uxSchedulerSuspended close enough and any memory corruption which corrupts uxSchedulerSuspended, will also corrupt uxSchedulerSuspended_Test. Since we have a data breakpoint on uxSchedulerSuspended_Test, the debugger will break as soon as it is changed (which should not happen in a legitimate case).

Thanks.

The debugger does not hit the uxSchedulerSuspended_Test breakpoint.

However, it initially hits a new assert, configASSERT( puxStackBuffer != NULL ) in xTaskCreateStatic. When I click resume, the code keeps running, and works for a couple of publish iterations before it hangs on the scheduler suspended assert again.

Are you passing NULL for the stack buffer? Can you share your task creation code?

If I comment out
PRIVILEGED_DATA static volatile UBaseType_t uxSchedulerSuspended_Test = ( UBaseType_t ) pdFALSE;

Then the configASSERT(puxStackBuffer != NULL) does not fire

Here is my main.c which runs the AWS vStartOtaDemo after the socket is running. I have modified vOtaDemoTask to not start the OTA yet, just the MQTTAgent for testing.

I’ve been placing printf’s in a lot of places and looking at the log. After the Agent posts a publish, it does not call the transport interface anymore to receive the QoS=1 Ack. So it seems the problem occcurs inside the MQTTAgent itself and not in my implementation of the transport interface.

main.c (12.8 KB)
OtaOverMqttDemoExample.c (68.9 KB)