My application using the posix port is getting deadlocked. I have found 2 different cases where it happens.
In the posix port the tick handler is implemented via a signal handler. signal handlers should not perform blocking calls. However I found 2 different cases where blocking calls are being made inside a posix signal handler (the tick interrupt).
I have screen captures of the stack trace in eclipse with my commentary showing the 2 different scenarios. These 2 screen captures are attached.
I was previously using the old William Davy posix port for many months which didn’t have this issue. I switched to the “official” posix port just 2 days ago and today I got complaints from other developers about random deadlocks. I investigated and found these 2 scenarios.
For the 1st deadlock when the task switch occurs while a thread is blocking on vTaskDelayUntil() I can’t think of a simple way to fix this.
For the 2nd deadlock I could see setting a flag in the tick handler to indicate that we are in an ISR and should not block. Then when vPortYield() if this flag is set then just return.
But for the 2nd deadlock situation when the signal handler occurs during vTaskDelayUntil() and then inside the tick handler the task suspends itself again I can’t yet think of a way around this.
This is much higher priority for me than the other issue with detecting stack overflow.
I’m probably going to have to go back to the old William Davy port to unblock my developers.
@gedeonag: Can you please have a look at this?
I remember reading that with this posix port that the SIGALRM/tick handler should only happen in the active thread. But clearly you can see it is happening in blocked threads.
This can possibly be explained because I am doing an infinite nanosleep() call in my idle thread. If I don’t do this the application is using 100% CPU on my system and that causes issues.
So when the timer is up and the signal handler gets called, if no thread is active because they are all blocked then I suppose a random thread is chosen for the signal handler? Don’t know if it would be possible to force the SIGALRM signal handler to interrupt the IDLE thread if everything is currently blocked?
A blocked thread will be awaken to service the signal I believe, we are purposely making all threads block signals, so only one thread should receive signals
Are you creating different threads with pthread_create ?
EDIT: Usually done to simulate interrupts by tasks that are not controlled by FreeRTOS scheduler
@gedeonag: Yes I am creating a single thread outside of FreeRTOS with pthread_create() for the purpose of simulating interrupts. This thread makes no FreeRTOS calls. It will set flags that are later checked in the tick hook, and then the tick hook may unblock semaphores.
Is it possible that this separate task with pthread_create() somehow makes it possible for the FreeRTOS threads to handle signals?
ah yeah that would be a problem, but with an easy fix
just block all signals on that thread right after creating it
Similar to this: https://github.com/FreeRTOS/FreeRTOS-Plus-TCP/blob/main/portable/NetworkInterface/linux/NetworkInterface.c#L661
The extra thread created (if not signal masked) will be able to handle signals destined for the single thread that is supposed to handle signals in port.c, as a consequence FreeRTOS will highjack that thread and schedule its own stuff on it… and weird things could happen, including deadlocks (could be a different issue what you are seeing)
@gedeonag: Thanks for that hint. I have made the change and ran my tests continuously overnight with no problems. So this issue is probably gone. I have submitted this fix to our repository and I shall see in the coming days if any of my developers complain about it again.
I suggest that this constraint gets mentioned this in the docs/webpage for the posix port.
Thanks @gedeonag for the terrific support!
EDIT: In my pthread I did enable SIGINT while disabling all other signals. I think SIGINT is needed for gdb? Could enabling SIGINT cause any issues?
I am glad the problem is gone for you
SIGINT should be fine, not necessary though, as the “main” thread, (the one set to receive signals) can still receive SIGINT and stop the process with Gdb