We are running Atmel SAM L21J18B
gcc version 5.4.1 20160919 (15:5.4.1+svn241155-1) (arm-none-eabi-g++)
FreeRTOSv9.0.0
It is our own code configured as tickless and running on our hardware.
Eventually we see the HardFault exception (once per few hours). I was able to capture stacks and the trace from the micro trace buffer. It is always the same failure in the same place at the moment, when the context switch loads new stack. I believe that it is pxCurrentTCB (line 1688 in attached stack_debug_freertos.txt file).
The crash is pretty consistent and happens several times a day on few different devices. It always happen, when the context switch is initiated by exiting from an interrupt (we see it just after the USB interrupt but I think it is a coincedence - may be any interrupt).
Looking to the trace (see the attached file), I decided that the vListInsertEnd() or prvAddCurrentTaskToDelayedList() didn’t complete their operations, when the USB interrupt was followed by the PendSV interrupt. So the RTOS was trying to switch the context while the stack was still invalid. Both vListInsertEnd() and prvAddCurrentTaskToDelayedList() are called from the vTaskPlaceOnEventListRestricted(), which is called by the vTaskPlaceOnEventListRestricted() (see Source/tasks.c). There is no critical section in any of this function.
I added taskENTER_CRITICAL()/taskEXIT_CRITICAL() calls at beginning/end of the vTaskPlaceOnEventListRestricted() function.
The failure bahavior changed: it started to fail in the vTaskPlaceOnUnorderedEventList(). So I added critical sections to vTaskPlaceOnUnorderedEventList() and vTaskPlaceOnEventList() also. It appears that other functions in tasks.c that call vListInsertEnd() and prvAddCurrentTaskToDelayedList() do have the critical sections.
After these three fixes all my systems are running for last 24 hours without the HardFault exception.
Not sure if it is a correct fix. Should I open a ticket (not sure how to do it)? I checked v10.1.1 sorces - this part of the code weren’t changed.
Forgot to mention: since the code is configured as tickless, if a task is pending on a queue forever, it will be suspended, not delayed (as far as I understand the tasks.c code). And our code has such tasks.
Does this still happen when tickless mode is not used?
Are you using the default tickless implementation, or one tailored for
your hardware.
The usual suspect questions to ask also:
Can you update to the latest FreeRTOS kernel code and ensure
configASSERT() is defined - the newer the code the more assert points
there are to catch interrupt priority misconfiguration - the latest
version catches nearly all (if not actually all).
Does the USB driver, or any other code, touch the basepri or any other
interrupt enable/masking registers?
Do you have the priority of all interrupts set at or below
configMAX_SYSCALL_INTERRUPT_PRIORITY if they are using FreeRTOS API
functions? Sometimes people think they are, but when they switch to the
latest FreeRTOS kernel with the extra asserts they realise they are not.
Do you have stack overflow protection turned on? If so, be aware that
only checks the tasks tacks, not the interrupt stack. The interrupt
stack is the stack used by main() (which is then re-used as the
interrupt stack to recover the RAM).
It is not easy to catch the problem (takes few hours/days) and we need tickless so I just didn’t have time to try without tickless. Also we adjusted the tickless to use with the L21 RTC.
As I said earlier, the trace is pointing to functions that were not changed in v10.1.1. But It is in the plan to go with the latest FreeRTOS soon.
The USB driver is from Atmel SDK. It calls its own critical sections:
# define cpu_irq_enable() \
do { \
g_interrupt_enabled = true; \
__DMB(); \
__enable_irq(); \
} while (0)
# define cpu_irq_disable() \
do { \
__disable_irq(); \
__DMB(); \
g_interrupt_enabled = false; \
} while (0)
Looks fine to me…
It is Cortex m0+…
I was playing with priorities before capturing the trace. Finally I set all priorities (including RTOS context switch to the same level). It never changed the outcome.
Yes, I turned the stack overflow protection in order to catch the problem - it didn’t show anything. I moved stacks around in the memory also - it still the same: just one pointer in the stack is corrupted.
Can you write me in private? Probably we can arrange a meeting and I can walk you through the trace?
It is not easy to catch the problem (takes few hours/days) and we need
tickless so I just didn’t have time to try without tickless. Also we
adjusted the tickless to use with the L21 RTC.
Intention is to see if the issue is related to tickless or not.
The USB driver is from Atmel SDK. It calls its own critical sections:
define cpu_irq_enable() \
do { \
g_interrupt_enabled = true; \
__DMB(); \
__enable_irq(); \
} while (0)
define cpu_irq_disable() \
do { \
__disable_irq(); \
__DMB(); \
g_interrupt_enabled = false; \
} while (0)
This is only ok because it is globally disabling and re-enabling
interrupts (assuming the macros are used from inside something that
counts nesting, otherwise interrupts are blindly re-enabled by the
macro) and FreeRTOS only uses interrupt masking with the basepri
register - so there is no conflict with the driver and the RTOS’s
accesses to the hardware.
Ah, re-reading your original post, I see you are using an M0, which
doesn’t have a base pri register, so the critical sections used by the
driver could indeed be a problem. If FreeRTOS is inside a critical
section when cpu_irq_enable() is called then the critical section will
be exited and you will have trouble. So:
Is it possible that cpu_irq_enable() is called from inside a critical
section? Probably not.
Is cpu_irq_enable() called in a may that sets the interrupt mask back
to whatever it was when cpu_irq_disable() is called? Or does it just
enable interrupt regardless of whether they were originally enabled or not?
How often is printf() called (how often is memory allocated?).
Calling printf() is one of the most common causes of issues in a
multithreaded environment with very small stacks.
I see your point. Let me check how cpu_irq_disabled/enabled are used.
Would love to try your suggestion with tickless disabling and a new RTOS but with testing it may take over a week.
Thank you,
Alex.
Hi Richard,
So theoretically using two different critical sections may bring a problem - I defined SDK functions to use the FreeRTOS functions. It didn’t solve the problem. It started to happen even more often (probably coincedense). Do you want to look with me into the trace file? IMO, it is pointing to a hole in the OS.
Thank you,
Alex.
We allocate memory just in init. Then the malloc() never called again. Never call free().
We do use printf for debugging. Messages are short. I can disable all of them but I doubt that it will help - I do not see any call to printf while analyzing the trace buffer.
Didn’t try it yet. On Friday I switch the USB driver into another mode (IP over USB rather than single characters). Still testing. Didn’t see an exception for 37 hours.