Task stuck at taskNOTIFICATION_RECEIVED

pbleyer wrote on Monday, August 01, 2016:

Hello.

I am experiencing lost task notifications in a communication module under heavy loads. The code affected is essentially a message transmission task that sends data to a transceiver. It gets notified either from a start transmission function or an ISR that sends a new notification when the physical transmission buffers become available.

The transmission task procedure sits in an infinite loop waiting for events:

void
optoProc(void *arg)
{
	while (true)
	{
		uint32_t e = ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
		
		if (e & opto_Tx)
		{
            // Get message from txQueue and encode bytes for transmission
            optoTxProc(...);
		}
	}
}

The ISR can wake up the task using the following code snippet:

if (buffer_available)
{
    BaseType_t pw;
    xTaskNotifyFromISR(optoTask, opto_Tx, eSetBits, &pw);
	portYIELD_FROM_ISR(pw);
}

And the message send function does basically the same after putting the message in a queue:

int
optoTxPut(OptoDevice *d, const OptoMessage *m)
{
	if (xQueueSend(d->txQueue, m, 0) != pdTRUE)
		return opto_EOverflow;

	xTaskNotify(optoTask, opto_Tx, eSetBits);
    return opto_Ok;
}

So everything is pretty straightforward. The interrupt priorities are configured properly and system/memory checking hooks are in place to trap any issues such as memory leaks or corruption.

When running the system under heavy packet traffic, I started noticing that the optoTxPut function was returning queue overflow errors even when hardware transceiver buffers were available, and transmission stopped. I realized that under that condition, the optoProc procedure becomes stuck waiting for notifications even when the xTaskNotifyFromISR function is being called and attempting to send notifications.

I tracked the underlying xTaskGenericNotifyFromISR function and I see that the decision to send a notification and wake up the task is performed there, when the task ucNotifyState state changes from taskWAITING_NOTIFICATION to taskNOTIFICATION_RECEIVED. However, when the issue happens, the task is already in the taskNOTIFICATION_RECEIVED state and therefore the task is never put in the task ready list. I can manually recover the system with the debugger modifying the ucNotifyState variable and triggering the notification actions.

I see that the ucNotifyState only gets changed to taskNOTIFICATION_RECEIVED in the task notify functions (xTaskGenericNotify, xTaskGenericNotifyFromISR, vTaskNotifyGiveFromISR). Is there any other behavior I am missing that can affect the value of the notification or remove the task from the ready list so the notification gets lost or the task is not run (and explaining why the variable ucNotifyState does not revert)? I debugged the interrupt and didnā€™t see any improper interaction between that and the send function. In the meantime my obvious workaround is to use a time delay instead of waiting forever to periodically force the task procedure to run and check the queues, but I certainly would like to find the root of the problem.

Thanks for any ideas.

davedoors wrote on Monday, August 01, 2016:

Using pdTRUE as the first parameter to ulTaskNotifyTake() will clear the notify count to zero, and as the code is posted you only process one notify for each call to ulTaskNotifyTake(). That means you can loose events. See the code examples on the link below, which shows two ways you can fix that.

http://www.freertos.org/RTOS_Task_Notification_As_Counting_Semaphore.html

pbleyer wrote on Monday, August 01, 2016:

Hi Dave. Thanks for replying.

Processing one notification is fine, since the optoTxProc function will exhaust the queues. It does not matter how many notification events were received. The task should wake up when the send function is called again or the interrupt realizes there are available buffers (which is what usually happens) and will send a notification. This will happen when optoTxProc has finished the byte encoding and the task is indeed waiting for a notification.

I have played with the parameters and used the counter behavior, however all have the same issue. The ucNotifyState will eventually be stuck in the taskNOTIFICATION_RECEIVED state, thus I suspect on the first change the RTOS call placed the task in the ready list as when the systems runs fine. But then something happens in between and the ucNotifyState either never reverts to taskWAITING_NOTIFICATION, or it does and the task is never put in the ready list to be executed after changing the variable again back to taskNOTIFICATION_RECEIVED.

I manually changed the FreeRTOS code to update the task lists regardless of the value of ucNotifyState, and as expected that will ā€œfixā€ the problem. I havenā€™t been able to detect the exact condition when the issue will occur since most of the time the system runs as expected. I am trying to isolate if the issue is interrupt or task related.

rtel wrote on Tuesday, August 02, 2016:

Iā€™ve been trying to get my head around exactly what it is you are describing here. It is hard to see how it could get into this state.

First, probably not relevant to your problem, but inside the ISR pw should be initialised to 0 (pdFALSE) before it is passed to xTaskNotifyFromISR().

Question: Why are you using the eSetBits action, to set bits in the receiving taskā€™s notification value, rather than just using eIncrement to increment the receiving taskā€™s notification value? Does the receiving task also check for other bits that you have not shown in your post?

The interrupt priorities are configured properly and system/memory checking hooks are in place to trap any issues such as memory leaks or corruption.

Are you using a Cortex-M processor? Is it an M0(+), M3, M4 or M7? Do you have configASSERT() defined?

Which version of FreeRTOS are you using? (taskWAITING_NOTIFICATION isnā€™t used in the latest version).

pbleyer wrote on Saturday, August 06, 2016:

Hi rtel. Thank you for replying.

  • You are right about the pw variable. By chance it happens to be initialized to 0 anyway, but the correct initializer was missing. And the problem is not related to that or to calling or not portYIELD_FROM_ISR based on the value of pw.
  • Normally, the task also checks for other bits. But for the purpose of debugging to isolate the issue and to post here I removed all the other actions based on other bits. From looking into the code I understand that receiving multiple notifications to update the same bit should be no problem, and on the first change from the ā€œwaitingā€ state to ā€œnotifiedā€ state, the task will be scheduled to be executed - which I am assuming occurs most of the time. From what it looks like as I explained above, the variable changed to notified and either the task was never scheduled, it was scheduled and for some reason it didnā€™t run, or it ran but the value of ucNotifyState was not properly restored. I can get into the same situation if I use the counter API. But then again, I am not using that because of the nature of how the optical transceiver is served and the way the hardware buffers become available.
  • This is indeed a Cortex-M4, configASSERT is defined and portASSERT_IF_INTERRUPT_PRIORITY_INVALID does not assert because priorities are properly configured. I also demoted the interrupt to the same priority as the RTOS to test if there were any interrupt conflicts, however everything seems to be configured fine in terms of interrupt priorities and the order of function calls. Interrupt priority was the first thing I thought was the cause of the issue and checked down to the register configuration of the NVIC.
  • I am using the trunk from SF. I thought this repo was the official one and up to date. I see the taskWAITING_NOTIFICATION macro defined and used in the tasks.c module.
    URL: svn://svn.code.sf.net/p/freertos/code/trunk
    Relative URL: ^/trunk
    Repository Root: svn://svn.code.sf.net/p/freertos/code
    Repository UUID: 1d2547de-c912-0410-9cb9-b8ca96c0e9e2
    Revision: 2466
    Node Kind: directory
    Schedule: normal
    Last Changed Author: rtel
    Last Changed Rev: 2466
    Last Changed Date: 2016-06-28 06:39:25 -0400 (Tue, 28 Jun 2016)

Best.

rtel wrote on Saturday, August 06, 2016:

By chance it happens to be initialized to 0 anyway,

No, it was uninitialised. It may have happened to take the value 0 some times.

I see the taskWAITING_NOTIFICATION macro defined

You are right, I had it the wrong way around. It was changed from an enum to a macro, whereas in my head it had been changed from a macro to an enum. The macro was introduced to force it to always be an 8-bit type in a portable way.

So we are no closer to solving this one. Could you please cut down your project to the absolute minimum amount of code that demonstrates the issue, make sure the project does not contain any absolute paths so I am able to build it on my computer without having to first make any modifications, zip it is up, and send it over to me at r [dot] barry (at) freertos [dot] org.

glenenglish wrote on Tuesday, August 16, 2016:

Hi Pablo
I am very new to freertos but I did observe the behaviour you describe- I think. And here is how I worked around it with a binary semaphore.

My IRQ handler generated a TaskNotifyGiveFromISRā€¦ok.

In the task blocked waiting on this would stop receiving the noifications under ā€˜heavy comms loadā€™ . I walked through the code and determined that my IRQ handler was generating another TaskNotifyFromISR before my waiting task could call the wait function.
like this
while(1) {
if ( TaskNotifyTakeWait()==pdTrue) {
process_something();

}
else {
/* timeout */

}
}
NOW- what was happening was that the function process_something() woudl reenable the source of the interrupts and a NEW interrupt and thus TaskNotifyFromISR would be generated BEFORE the task loop could call TaskNotifyWaitā€¦
IE if the TaskNotifyFromISR comes in but the task is not blocked already, nothing happens.

I solved this using binary semaphores.- as it is value driven. that is xSemaphoreTake

Maybe this will help you. I only have 1 day on freertos, but some experience elsewhereā€¦

I am NOT sure if my observed behaviour is correct- but that is what the debugger showed- the task was not waitingā€¦ regards - Glen.

I know itā€™s been a while since this was posted, but I just ran into the same issue.

Iā€™ve only observed this a few times, but yesterday I finally caught it with the debugger. The symptom was that one of my tasks would stop doing its thing (because it was waiting on ulTaskNotifyTakeIndexed()). I didnā€™t realize what the problem was until it happened on this unit with a debugger connected.

When I stepped through the code, the notifications were being sent with xTaskNotifyGiveIndexed(), but never processed. Stepping through xTaskNotifyGiveIndexed(), I saw that the ucOriginalNotifyState = pxTCB->ucNotifyState[ uxIndexToNotify ]; was set to 2, which was taskNOTIFICATION_RECEIVED. The pxTCB->ulNotifiedValue[ uxIndexToNotify ] would be incremented, but then the task was never put on the ready list because it expects ucOriginalNotifyState == taskWAITING_NOTIFICATION.

I donā€™t know how it got in this state and I havenā€™t been able to reproduce it again, but I do know it has happened before. For reference the actual wait call is:
ulTaskNotifyTakeIndexed(XXX_TASK_NOITIFY_INDEX, pdTRUE, portMAX_DELAY);

I am running on a Cortex-M4 and running FreeRTOS Kernel V10.4.4

In my case, a hacky temporary solution might be to not wait forever with the portMAX_DELAY, but instead timeout and report an error if the timeout occurs.

I donā€™t suspect Iā€™ll be able to get this on an easily reproducible way, so I just wanted to post for visibility that this might still be an issue.

Hi Alvaro,

Seems the most likely explanation is something corrupts the notification state while the task is blocked in ulTaskNotifyTakeIndexed(). It would lead to your exact symptoms. Do you have stack-overflow checking turned on? Do you have a malloc-failed hook installed? If using newlib, have you looked at configUSE_NEWLIB_REENTRANT? If you have advanced data breakpoints, you could set one for a write of any value other than 0, 1, and 2 to that byte in the TCB.

Jeff

Hi Jeff,

For some reason I didnā€™t get a notification about your reply :man_facepalming:
I do have configCHECK_FOR_STACK_OVERFLOW 2 and had malloc failed hooks at the time. I have a full coredump of one of the cases when this was happening and none of my task stacks are even close to being full.

Hmm, I donā€™t think configUSE_NEWLIB_REENTRANT is enabled. Something to take a look at for sure! Iā€™m using mpalandā€™s printf at the moment, which is different, but shouldnā€™t be an issue.

Iā€™ll do more investigating if I see this again, but sometimes it takes a very long time to reproduce, so I wonā€™t get my hopes up hah.

Alvaro

Dragging this back up from the dead in case someone else has this problem again or can explain whether/why the fix I picked up is a fix.

I ran into this problem on 10.4.3. A task whose notification comes from another task is ā€œsuspendedā€ indefinitely but the uxNotifiedValue value for the notification index is 1 (thatā€™s what I want it to be) and the ucNotifyState is 2 (taskNOTIFICATION_RECEIVED). Task is blocking for notifications using portMAX_DELAY. The only reason the task blocks is to wait for a notification.

In my case applying the change here from PR number 833 on github (sorry, new users canā€™t post links!) perturbed the issue out of existence, but I canā€™t explain why. In this PR in xTaskGenericNotifyWait the call to prvAddCurrentTaskToDelayedList is moved outside of a critical section where it previously was in one.

Hi @sgreen

Welcome to FreeRTOS Community Forums !

The PR you mentioned , suspends the scheduler before exiting the critical section (i.e. before enabling interrupts). If we do not do so, a notification sent from an ISR, which happens after exiting the critical section and before suspending the scheduler, will get lost.

What might be happening in your case would be ( without the fix from the PR ) is the task is not moved to the pendingReadyList , and hence when the scheduler resumes, it cannot move to the task into the Ready list.

If the issue still persists, I would suggest to open a new issue, since this issue is very old.

Thanks.

Thanks for the reply. I should have explained a bit more detailā€¦ weā€™re using 10.4.3 and need to patch this issue until our CPU vendor provides a newer release. Iā€™ve pulled in the changes from PR 833 and it seems to have fixed the issue.

@sgreen A small correction here, particularly for future readers. PR 833 isnā€™t a fix for the ā€œtask stuckā€ issue. PR 833 changed the code to stop performing a non-deterministic operation from within a critical section. Prior to the PR, the entire operation was performed inside a critical section, so the code was functionally correct. However, it violated a FreeRTOS tenet about nondeterministic operations inside critical sections. After the PR, the operation is performed with the scheduler disabled and not in a critical section. The PR had to be careful about connecting a critical section to the scheduler disabling, so there are lots of explanatory comments in the code. The sequence of events identified in the comments as incorrect operation are impossible both before and after the PR.

Bottom line, if that PR fixes your issue, itā€™s not clear how. Maybe an interrupt isnā€™t processed quickly enough when the pre-PR code uses a critical section for longer than it should? How many tasks are in your application?

2 Likes

Thanks Jeff,

There are 13 tasks in the system (maybe thatā€™s the problem :smile:). There is an interrupt every ~300us, it wakes up a task using a FreeRTOS queue, this task in turn wakes up other tasks using notifications. Iā€™ll review this interrupt, the task it wakes up, and check other interrupts after the weekend to see what can happen if any are delayed.