UNDEFINED BEHAVIOR using vPortFree() on in-use kernel object

Sorry for using “undefined behavior” to bring you guy here.

I have not found the answer for following question yet:

What happen if we use vPortFree(mutex/queue) in below cases ?

  • Mutex M is taken by task A and task B is blocked because it wants to take mutex M
  • At that moment, Task C call vPortFree(M) and task A, B still run periodically.
  • What will happen next?

More specific:

  • I am using FreeRTOS V8.2.1 on nRF51822 (which has only one UART) for a tracking device.
  • Task A run each 90 minutes. It use UART.
  • Task B run each 4 minutes. It also use UART but not always.
  • I use mutex M for UART.
  • There is a emergency button which is dedicated for forcing task A and B to run immediately
  • When user press the button, I use vPortFree() for M, A, B and re-create all of them (ofcouse there are some reasons I have to do this, I dont want this topic to focus on this way of handling the emergency button, but if everyone has some idea, i would be happy to listen)
  • In the durable test case, the device run 24/7 for 1 month. There is a random emergency trigger 24 times a day.
  • I found 3% of total times of triggering emergency button cause LOCKUP reset based on information from this document: https://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.pdf
  • After going around for the reason of LOCKUP reset, i found very few information about that. Every piece of clue leads to the reason related to error during hardfault handler.
  • I think using vPortFree(M) related to this LOCKUP reset phenomenon.
  • I tried to read deeper to the kernel code but i’m not an expert to figure it out so i post it here to see if someone can help me. Thanks.

Calling vPortFree directly on a mutex will corrupt your program. Even using vSemaphoreDelete (the proper way to remove a mutex) that is currently in use will cause problems, as you are invalidating an item that is in use.

You say you don’t want to focus on it, but the whole concept of deleting and recreating objects/tasks is the core issue here. Deleting a running task will lose any resource that the task currently holds. Deleting an object that is in use will corrupt your program state. DON’T DO IT THAT WAY.

On ‘big’ systems, you can have a supervisor task delete a process and recreate it, because the system isolates the processes and keeps track of what resources each process has, so deleting a process at a random point can work. FreeRTOS is not such a system so you shouldn’t treat it like it was.

The way to handle this sort of emergency abort is to set a flag that each task needs to monitor and change its behavior accordingly, and perhaps send the task an abort delay operation (and be sure you check for timeout, even for infinite delays, since they now are infinite). This allow each task to ORDERLY terminate their current operation and get into the emergency operation.

2 Likes

We are always here…(needed 20 characters to be able to post)

1 Like

Great thanks for your confirmation. So we should not delete any in-use kernel object which results in corruption (I experience CPU LOCKUP reset and watchdog reset).

But I did not really understand the method you propose here:

The way to handle this sort of emergency abort is to set a flag that each task needs to monitor and change its behavior accordingly, and perhaps send the task an abort delay operation (and be sure you check for timeout, even for infinite delays, since they now are infinite). This allow each task to ORDERLY terminate their current operation and get into the emergency operation.

Can you explain more detail? I would love to hear that!

I’m not speaking for @richard-damon but here’s my take on it.

Depending on how many tasks are in the system you can choose to use a common event group or enumerate all the tasks in the system and send them a notification as per the tip given in the events group page.

You simply have to make sure not to use infinite timeout (portMAX_DELAY) values as these would prevent anything else being processed in the tasks until what ever even the API call is waiting for has occurred.

So a simplified code example would be:

void aTask( void* pvParams)
{
    uint32_t ulNotifiedValue;
    BaseType_t xresult = pdFAIL;

    for(;;){
        xResult = xTaskNotifyWait( pdFALSE,    /* Don't clear bits on entry. */
                                   ULONG_MAX,        /* Clear all bits on exit. */
                                   &ulNotifiedValue, /* Stores the notified value. */
                                   pdMS_TO_TICKS(10) ); /* wait at most 10ms for the notification */

          if( xResult == pdPASS )
         {
          /* The task received a notification to shutdown so clean up any allocated resources
              and delete ourselves, or reset task state machine.
          */
         }
         else
         {
          /* no task notification so process the normal activity of the task */
          /* make sure no API call uses portMAX_DELAY */
         }
    }
}

Hope this helps clear things up a small bit.

1 Like

The method I described is to use a global flag (or flags) that the ‘emergency’ task can set, that tasks that need to respond to the emergency can check when they are doing some action that might take awhile, letting them voluntearily ‘abort’ their activity and move to there emergency response. A task that might be stuck in an wait of some sort (waiting on a queue or semaphore for some action) can be forced to wake up by using xTaskAbortDelay(). Alternatively if you know what they might be waiting on, you can send the needed notification through that channel.

The real basis is that fundamentally you design the tasks to handle the emergency, as opposed to trying to externally forcing the task to do something outside its programming. If you really get to a state where you can’t control a task back to the needed state, normally the best solution is a total system restart, not a partial one, so you can reinitialize everything.

1 Like

@SergentSloGin

Thanks for your idea. I will see if i can apply that to my project!

@richard-damon

The second idea, which is restart the whole system, has a small chance of memory corruption when writing to EEPROM or FLASH. I can handle that by looping to try to take the semaphore of the memory with zero timeout. When any writing to memory is stopped, I can do a system restart. However, my device contains some modules such as Iridium 9603. They has their own firmware running inside it. Any system reset can potentially damage any using module.

Because of that, I prefer your first method:

The method I described is to use a global flag (or flags) that the ‘emergency’ task can set, that tasks that need to respond to the emergency can check when they are doing some action that might take awhile, letting them voluntearily ‘abort’ their activity and move to there emergency response. A task that might be stuck in an wait of some sort (waiting on a queue or semaphore for some action) can be forced to wake up by using xTaskAbortDelay(). Alternatively if you know what they might be waiting on, you can send the needed notification through that channel.

I think the trade-off of this method is the overhead for checking emergency flag that we insert to the task. I plan to break each task into several sections that need to be atomic, insert emergency flag checking between these section, and in the loop that take too much time. If any check point detect the emergency, the task is terminated and to release all the resource. Then, just start the emergency response.

Please verify for me that I understood your idea. I will implemented this idea, testing, and also do a massive OTA. This will take time but will tell you the result and all feedback @richard-damon

Yes, you seem to understand my suggestion. I will add that there is little overhead in doing this. It adds some clutter to the code, but checking a global variable is quick.

1 Like

Agreed. Since the tracking application has only soft real-time requirement, these overhead are no big deal. Thanks again.

Even with hard real-time requirements, adding the test only adds a few cycles to the code, and would be normally be just after or just before a call to a FreeRTOS function, which would be much slower.