Tasks waiting on queues get stuck

mldevw · October 22, 2020, 4:42am

Hello,

i have an application with several tasks (~12) running at the same time.

There are tasks blocked by a queue which will get data only seldomly (on user input).
After a few hours, the queue will fill with data on user input (I checked), but the task keeps suspended.

In the Debugger it says Suspended, blocking in the queue if it works,
If it does not work it says it is blocking on an unknown object.

I have StackOverflowDetection enabled and checked the HighWaterMark and I have plenty of space in statically allocated tasks.

Can you please help me to make the system reliable?

I am using FreeRTOS Kernel V10.0.1

rtel · October 22, 2020, 3:51pm

Do you have configASSERT() defined?

mldevw · October 22, 2020, 4:27pm

Hello,

yes it is defined, but is is not triggered.
The Code is running normally, with the exception that the one thread stays suspended.

I also had ported the most current FreeRTOS-Version and now it is running fine since the morning. I hope it will still be good tomorrow.

Edit: Never mind, error is back after reboot.

hs2 · October 22, 2020, 8:03pm

When you say there are tasks blocked on a queue you mean there is a dedicated queue per task, right ? And a number of tasks are getting blocked forever unexpectedly or just 1 task ? If a couple of those tasks are affected is one after the other or all at once ?
However, it’s a pretty strange erroneous behavior and I afraid that something in your code corrupts 1 or more queue data structures. Do you dynamically allocate the queues ? Then it might be a heap corruption e.g. if the heap is not thread-safe.
Unsafe string functions can also cause headaches (and memory corruptions).
But it’s good to be sure that stacks are large enough

mldevw · October 23, 2020, 4:29am

Hello,

I have no noticed that multiple tasks can get stuck, but each with their own queue.
It is not predictable, which task will get stuck when, they seem to be independent.

I have disabled dynamic allocation and I am not using malloc throughout the application because of safety / reliability concerns.

I will have a review of all the tasks today to check for memory corruption potentials…

Please let me know if you have any other ideas.

Update: I reviewed for memory corruption but couldn’t find a problem
I also disabled preemtion, but that didn’t work either
=> Problem persists.

rtel · October 23, 2020, 3:11pm

Is it possible that you post some code that demonstrates the issue?

mldevw · October 28, 2020, 8:35am

Hello, unfortuanetly the project is quite complex and I’d need to share about 100.000 lines of code. So I am sorry, I cannot share.

hs2 · October 28, 2020, 8:54am

Maybe it’s possible to post just the relevant code snippets.
The basic user input (ISR ?) part feeding the queue(s) and the task queue receive loop along with the queue creation (global variables ?).
However, your problem is still strange and hard to tackle.

Is it a custom/your project or a generated one (e.g. by STMCube) ?
Which MCU/port and compiler is used ?
How much main stack (depending on port it might be used as ISR stack) is reserved (usually in linker script) ?
When stuck is the queue handle (the pointer to the queue structure) still equal to the pointer after creation ?
When the debugger says unknown object what does it mean ? Is it documented ?

mldevw · October 28, 2020, 2:00pm

Hello,

thank you for your questions.
Here are the my answers:

This is a custom pcb with a NXP Kinetis KL82
Project stack usage is about 93 % (is this what you asked for?)
When the error occurs, the queue is still there and gets loaded with data, I can observer that in the debugger.
I am using McuXpresso. It does not specify what event object “unknown object” means. If it works, the name of the blocking queue is in that field

The error not only occurs if an isr is posting data to the queue, but also if another task is posting the data.

Here is some code:

1. Definition of queue in header:

public:
static QueueHandle_t interruptQueueHandle;
private:
static StaticQueue_t interruptQueue;
static uint8_t interruptQueueStorageArea[INTERFACE_VEHICLE_INTERRUPT_QUEUE_LEN];

2. Definition of queue in class:

StaticQueue_t InterfaceVehicle::interruptQueue;
uint8_t    InterfaceVehicle::interruptQueueStorageArea[INTERFACE_VEHICLE_INTERRUPT_QUEUE_LEN];
QueueHandle_t InterfaceVehicle::interruptQueueHandle;

2.1 Initialisation of Queue

interruptQueueHandle=xQueueCreateStatic(INTERFACE_VEHICLE_INTERRUPT_QUEUE_LEN,1,interruptQueueStorageArea,&interruptQueue);
vQueueAddToRegistry(interruptQueueHandle, "IV Q Interrupts" );

3. ISR, which is posting data to queue (I can see this is working, as I can see the items in the queue are getting more, but they are not read from the queue)

// Gets GPIO interrupts and sends them to queue
void InterfaceVehicle::onInterrupt(GPIO_Type *gpio, uint32_t mask)
{
    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    uint8_t interruptID;

    if ((gpio==PIN_INTERFACEVEHICLE_KL15_STAT_GPIO)&&(mask&(1<<PIN_INTERFACEVEHICLE_KL15_STAT_PIN)))
	    {
	    interruptID=INT_ID_KL15;
	    xQueueSendToBackFromISR(interruptQueueHandle, &interruptID, &xHigherPriorityTaskWoken);
    }

    [...]

    if( xHigherPriorityTaskWoken )
    {
	    taskYIELD();
    }
} // end of function onInterrupt

4. Definition of Task reading the queue in header

private:
static void interruptHandler(void* nothing);
static StackType_t taskInterruptHandlerStack[STACK_SIZE_VEHICLE_INTERRUPT_BUFFER];
static StaticTask_t taskInterruptHandlerBuffer;

5. Definition of Task in class

StackType_t InterfaceVehicle::taskInterruptHandlerStack[STACK_SIZE_VEHICLE_INTERRUPT_BUFFER];
StaticTask_t InterfaceVehicle::taskInterruptHandlerBuffer;
TaskHandle_t InterfaceVehicle::taskHandle_interruptHandler;

6. Task Creation

taskHandle_interruptHandler=
			xTaskCreateStatic(interruptHandler,"IV Interrupts", STACK_SIZE_VEHICLE_INTERRUPT_BUFFER,
					NULL, TASK_PRIORITY_INTERFACE_VEHICLE, taskInterruptHandlerStack, &taskInterruptHandlerBuffer);

7. Task implementation

void InterfaceVehicle::interruptHandler(void* nothing)
{
uint8_t interruptId;
#ifdef STACK_SIZE_PROFILING_ENABLED
UBaseType_t uxHighWaterMarkCurrent;
#endif
while(1)
{
	if (xQueueReceive(interruptQueueHandle,&interruptId,portMAX_DELAY) == pdPASS)
	{
		switch(interruptId)
		{
		case INT_ID_KL15:
			Debug::log(LOGLEVEL_DEBUG,__class__,__func__,"KL15");
            [..]
			break;
		[..]
		}

		
#ifdef STACK_SIZE_PROFILING_ENABLED
	uxHighWaterMarkCurrent= uxTaskGetStackHighWaterMark(NULL);
	if (uxHighWaterMarkCurrent!=uxHighWaterMarkInterrupt)
	{
		Debug::log(LOGLEVEL_DEBUG,__class__,__func__,"HighWaterMark %d words",uxHighWaterMarkCurrent);
		uxHighWaterMarkInterrupt = uxHighWaterMarkCurrent;
	}
#endif
}
vTaskSuspend(NULL);
}

mldevw · October 28, 2020, 3:38pm

Hello, I noticed even tasks which just block on delay are getting stuck. Then the event object “Unknown” appears, where there was nothing before.

rtel · October 28, 2020, 4:21pm

Just focusing on this for the moment. Am I right in interpreting this as if you just create one task that does this:

void my_task( void *pv )
{
volatile int my_var = 0;

    for( ;; )
    {
        vTaskDelay( 10 );
        my_var++;
    }
}

then my_var will never get incremented?

If so, does the task ever get entered? If the task is not entered then it just sounds like there is something in starting the scheduler that is the issue - which may be the installation of the SVC exception hander that starts the first task. If the task is entered, and vTaskDelay() gets called but doesn’t return, then it sounds like the tick interrupt is not executing.

hs2 · October 28, 2020, 5:03pm

It’s probably not a problem, but instead of

in your ISR better use the appropriate macro
portYIELD_FROM_ISR( xHigherPriorityTaskWoken );

Since your MCU has a Cortex-M0 core there are 2 stack pointers: MSP and PSP (main and process stack pointer). PSP is used for task stacks and MSP is used by startup code and ISRs (see below). The FreeRTOS stack tracing macros only take care of the task stacks i.e. PSP.
Main stack is usually defined by linker script (usually a *.ld file) and it’s start address is the first entry in the vector or exception table right before the Reset_Handler entry.
It’s not managed by FreeRTOS and it’s up to application writers to define it appropriately.
In a FreeRTOS application main stack is used during program startup: Reset_Handler, CRT init, … until the scheduler is started in main() which in turn starts the first task. Afterwards the main stack is the ISR stack and it’s size should be large enough to cover your ISR code.
I guess the traced stack usage of 93% is the maximum task stack usage (derived from the FreeRTOS stack watermarks).

Try to find the linker script file and check/verify the (main) stack related definitions just to be sure.

mldevw · October 28, 2020, 6:07pm

Hello,

in this setup the variable would be incrementing to some point and then would stay at the same value.

Meanwhile some other tasks will continue execution as expected, some tasks will also be stuck (although not linked in any way)

How could I check if there is a problem with the tick interrupt?
If it was not working at all I guess all of the tasks would be stuck, right?

mldevw · October 28, 2020, 6:24pm

Hello, thanks for the feedback.
I am now using portYIELD_FROM_ISR throughout the program

I have no found how to check the linker script, so I uploaded it there

I have not manually changed anything there. I also have not defined the main stack, probably my IDE has done while creating the project.
Is the configuration alright?

The 93% are the calculated stack usage by the compiler.
As I am only using static allocations and no dynamic stuff this should be the maximum stack usage and thus fine, I guess?

hs2 · October 28, 2020, 6:46pm

0x1000 == 4KB main stack are usually more than enough. So this shouldn’t cause any problems even though only you know the ISR code. But I guess it’s following the recommended design doing less in ISRs (handle the required HW control and notify a task) and defer most work to a notified task.
I don’t know how good compiler stack usage calculation is nowadays. But (almost) for sure it’s just an estimation. So 93% might be fine and won’t be reached at runtime. If it’s underestimated you’ll run into troubles. But since you’ve enabled FreeRTOS stack checking and configASSERT you most likely detect task stack overflows.
Sorry, it’s even more strange, what’s going wrong …

mldevw · October 29, 2020, 3:09pm

Hello, I found out the error was disappearing when I disabled the freertos_i2c driver and switched to the normal i2c driver (protected by mutexes).

I still wonder how this corrupts tasks which don’t even make calls to the library…
Anyway, I seem to have a starting point to trace my problems yet…

mldevw · October 29, 2020, 7:16pm

I am back again… This was not the problem. Tasks still get stuck.

rtel · October 29, 2020, 8:40pm

Have you tried any kernel aware plug-ins or trace tools to give you better insight into the issue?

mldevw · October 30, 2020, 5:39am

yes. This is how I found out that the “Event Object” is the queue blocked on or nothing (for tasks with delay) and in the error case it is “Unknown Object” with an invalid adress.

rtel · October 30, 2020, 6:56pm

Which plug-in is this? Asking as sometimes (in the WITTENSTEIN state viewer plug in for example) it will only know what the object is if the object was explicitly named by calling an API function. Otherwise if the object was known, or the object name was known, but is not when the error occurs, it could be that the object is genuinely corrupted.