Highest priority task keeps in running state blocking the other tasks

Hi, we are developing an application with PIC32MZ and near 14 tasks running in the RTOS, but one of those tasks (priority 5) is crashing the code, this task executes without issues for x (random) minutes then keeps in running state forever and blocks all the other tasks, this is a capture we get from tracealyzer, we are working with MPLAB.

fail:

this is the red task:

    void TCAN4x5x_vTaskWatchDog() {

        TickType_t xTicksToDelay = pdMS_TO_TICKS(300); // 300 ms
        TickType_t xLastWakeTime = xTaskGetTickCount();

        for (;;) {
            vTaskDelayUntil(&xLastWakeTime, xTicksToDelay);

            if (TEST_SWITCH_1_Get()) { //Switch acts as a board reset
                TCAN4x5x_WDT_Reset();
            }
            
            TCAN4x5x_SPI_Status status;
            TCAN_Device_ReadSPIStatus(&status);
            
            if(status.Internal_error_interrupt){
                CAN2sig_ICU_Error_Code.setValue(CAN2sig_ICU_Error_CodeVT_TCAN_Internal_Error);
            }else if(status.SPI_error_interrupt){
                CAN2sig_ICU_Error_Code.setValue(CAN2sig_ICU_Error_CodeVT_TCAN_SPI_Error);
            }

        }
    }

    void
    TCAN_Device_ReadSPIStatus(TCAN4x5x_SPI_Status *reg){
        reg->word = AHB_READ_32(REG_SPI_STATUS);
    }

Do you have any idea of what could cause this issue?

also, the red task is not created inside the SYS_Tasks ( ); function, this task is created inside another task (one with lower priority), is this correct or all the tasks must be created inside SYS_Tasks ( )?

thanks

The only scenario I can think of here that involves FreeRTOS is that your xLastWakeTime gets overwritten somehere in the function body so that the wait always returns immediately. To test that theory, replace the wait until with an absolute wait and see if the behavior changes.

The usual debugging technique here is to add several counters in the infinite loop and monitor those counters via real time watch if your IDE allows for that.

Another thing to check is to break it in debugger and see what the task is doing when it appears to be running forever.

Thanks.

Is it possible that this line of code gets stuck in a loop somewhere?

As per Gaurav’s suggestion - if you know the task is running then break in the debugger and see what it is executing - Pilot’s suggestion that it might be in TCAN_Device_ReadSPIStatus() would seem likely unless you just have a data corruption somewhere.

We have a task that runs every 1ms, it is recommended to have a task that executes at the same frequency that the RTOS tick?

Is that the task that’s trashing? Might be an issue with the delay until then. Again, test with an absolute wait.

Hi, but in that case the task would fail immediately but that is not the case, if you see the image the red task runs normally and suddenly keeps in running state

no, the task wit issues is detailed in the code y shared, it runs every 300ms with priority 5, the task running at 1ms is a task with priority 2 (color blue - QUECTEL_GNSS)

The task running at the same frequency as the RTOS tick shouldn’t be a problem, though you will likely find that it doesn’t ACTUALLY run every millisecond. That’s a whole separate issue though, and has nothing to do with this query.

As for this issue, you really need to break the debugger when it hangs in that task and find out what it’s doing. Until you know that, there’s not going to be much more anyone can tell you. In all likelihood, the RTOS isn’t forcing that task to stay active, the task itself if forcing the RTOS to allow it to remain active because it’s busy doing something.

No, it can theoretically fail at any time.

Again, you should first determine where in the infinite loop you are “stuck.” Either any of the functions called within the for loop is stuck in a “local” infinite loop, or the wait function always returns immediately without waiting, indicating some kind of Problem with the xLastWakeTime variable once your scenario is reached. To me it looks like all of the code in your body except the wait may be non yielding, so if the wait does not block, you would always run through the loop in a CPU bound computation.

It is fairly straightforward debugging to determine which of the two is the case. Several techniques have been proposed here. Once you have figured out which of the two scenarios hold, the rest is routine.

After halting the target with the debugger you can check the backtrace/call stack to get the information which functions lead to the point where you stop the program.

Sorry, I had another projects, now I return to this post

@RAc

Hi, what infinite loop are you talking about? and I dont understand how could that counters be useful, thanks

@aggarg
Hi, I stop the debug when the program crashes but the debugger in MPLAB gets stuck and does not continue to the next instruction when I use the step over button

@pilot

@RAc

No, we are sure that that line is not the problem because even if we don’t use the task that actually is crashing the program (TCAN4x5x_vTaskCW) another task (a random one) enters in this running state and crash the program

Using tracealyzer I realized this, CPU load running in normal conditions (the program does not crashes):

tasks_before_crash

here I could see that Quectel Rx task is causing excessive cpu load when the program starts and then the GNSS task causes excessive cpu load during the entire execution of the program, I found the cause of this overload is a while instruction recommended by microchip

UART4_Read(&rx_buffer[0],3);
//while(UART4_ReadIsBusy()); causes cpu overload

after disable this while loop I don’t have cpu overload when the program is in normal operation, but still fails, we could capture a trace just when the program crashes and we got this:
tasks_after_crash

So we could see the CPU load is better but at the end we got an overload, we realize that when we don’t use the UART4_Read function the programs never crash, it seems there is a problem with the microchip library because we test with another code to get messages from the UART in a different way (but using the UART_Read function) and it crashes too, does anybody has had a problem like this?, I mean, with the UART libraries and freertos?

Hi, how can I check the backtrace/call stack? I am using MPLAB and the debugger gets stuck in the execution memory tab, it seems the debugger crash with the program execution too, I have to close mplab and re-launch the debugger to start another test
debug gnss

I’m talking about the task function which (as in most cases) is implemented as an infinite loop (for( ; ; ){}).

Here’s some pseudo code:

for (; ; )
{
increment counter 1;
[some stuff]
increment counter 2;
[some other stuff]
increment counter 3;

}

Assuming that your IDE allows you to watch variables in real time, do so with counter 1 up to counter x. If you see, for example, that counter 3 does not get incremented for several seconds after counter 2 increments, you know your problem code is in [some other stuff].

If the delays are too small to reliably see the deltas, simply compute the deltas between the increments (ideally not using the OS timer but some hardware timer or non conforming interrupt timer) and store the maximum on the fly.

Edit: This technique is also useful for nailing down deadlocks.

Hmm… the debugger shouldn’t crash when just halting a target.
So you can’t click/activate the Call Stack tab after stopping the target ?
Regarding UART4_Read is rx_buffer large enough matching the requirements of UART4_Read ?

Just when the program crashes I pause the debug session, then the cursor goes to the execution memory tab, I can press the step over button maybe 5 or 6 times then the debugger keeps “running” a long time trying to get the next line but he does not get it and I close mplab

I could try, which option should I use?, sorry for debug I only use the step over and into buttons I never had debugged at this level

yes, I also incremented the buffer size and still fails

In your previous screenshot when in Execution Memory window the rightmost tab is entitled Call Stack. I’m sure there is a corresponding menu entry to activate it, too.

With MPLAB, if breaking the debug session leaves you in outer space with no step-over/step-into functionality, you can pretty much guarantee it’s stuck in one of microchips libraries.

Personally, I don’t use a single Harmony/PLIB/etc library from microchip, because they all suck. I write my own hardware drivers.

You should still be able to activate the call stack though (Window → Debugging → Call Stack) which will confirm that some microchip library routine was called.