Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

agostrer wrote on Wednesday, December 05, 2018:

We are running Atmel SAM L21J18B
gcc version 5.4.1 20160919 (15:5.4.1+svn241155-1) (arm-none-eabi-g++)
FreeRTOSv9.0.0
It is our own code configured as tickless and running on our hardware.

Eventually we see the HardFault exception (once per few hours). I was able to capture stacks and the trace from the micro trace buffer. It is always the same failure in the same place at the moment, when the context switch loads new stack. I believe that it is pxCurrentTCB (line 1688 in attached stack_debug_freertos.txt file).
The crash is pretty consistent and happens several times a day on few different devices. It always happen, when the context switch is initiated by exiting from an interrupt (we see it just after the USB interrupt but I think it is a coincedence - may be any interrupt).

Looking to the trace (see the attached file), I decided that the vListInsertEnd() or prvAddCurrentTaskToDelayedList() didn’t complete their operations, when the USB interrupt was followed by the PendSV interrupt. So the RTOS was trying to switch the context while the stack was still invalid. Both vListInsertEnd() and prvAddCurrentTaskToDelayedList() are called from the vTaskPlaceOnEventListRestricted(), which is called by the vTaskPlaceOnEventListRestricted() (see Source/tasks.c). There is no critical section in any of this function.

I added taskENTER_CRITICAL()/taskEXIT_CRITICAL() calls at beginning/end of the vTaskPlaceOnEventListRestricted() function.
The failure bahavior changed: it started to fail in the vTaskPlaceOnUnorderedEventList(). So I added critical sections to vTaskPlaceOnUnorderedEventList() and vTaskPlaceOnEventList() also. It appears that other functions in tasks.c that call vListInsertEnd() and prvAddCurrentTaskToDelayedList() do have the critical sections.
After these three fixes all my systems are running for last 24 hours without the HardFault exception.

Not sure if it is a correct fix. Should I open a ticket (not sure how to do it)? I checked v10.1.1 sorces - this part of the code weren’t changed.

agostrer wrote on Wednesday, December 05, 2018:

Forgot to mention: since the code is configured as tickless, if a task is pending on a queue forever, it will be suspended, not delayed (as far as I understand the tasks.c code). And our code has such tasks.

rtel wrote on Wednesday, December 05, 2018:

Does this still happen when tickless mode is not used?

Are you using the default tickless implementation, or one tailored for
your hardware.

The usual suspect questions to ask also:

Can you update to the latest FreeRTOS kernel code and ensure
configASSERT() is defined - the newer the code the more assert points
there are to catch interrupt priority misconfiguration - the latest
version catches nearly all (if not actually all).

Does the USB driver, or any other code, touch the basepri or any other
interrupt enable/masking registers?

Do you have the priority of all interrupts set at or below
configMAX_SYSCALL_INTERRUPT_PRIORITY if they are using FreeRTOS API
functions? Sometimes people think they are, but when they switch to the
latest FreeRTOS kernel with the extra asserts they realise they are not.

Do you have stack overflow protection turned on? If so, be aware that
only checks the tasks tacks, not the interrupt stack. The interrupt
stack is the stack used by main() (which is then re-used as the
interrupt stack to recover the RAM).

agostrer wrote on Wednesday, December 05, 2018:

Hi Richard,

It is not easy to catch the problem (takes few hours/days) and we need tickless so I just didn’t have time to try without tickless. Also we adjusted the tickless to use with the L21 RTC.

As I said earlier, the trace is pointing to functions that were not changed in v10.1.1. But It is in the plan to go with the latest FreeRTOS soon.

The USB driver is from Atmel SDK. It calls its own critical sections:

#  define cpu_irq_enable()                     \
	do {                                       \
		g_interrupt_enabled = true;            \
		__DMB();                               \
		__enable_irq();                        \
	} while (0)
#  define cpu_irq_disable()                    \
	do {                                       \
		__disable_irq();                       \
		__DMB();                               \
		g_interrupt_enabled = false;           \
	} while (0)

Looks fine to me…
It is Cortex m0+…

I was playing with priorities before capturing the trace. Finally I set all priorities (including RTOS context switch to the same level). It never changed the outcome.

Yes, I turned the stack overflow protection in order to catch the problem - it didn’t show anything. I moved stacks around in the memory also - it still the same: just one pointer in the stack is corrupted.

Can you write me in private? Probably we can arrange a meeting and I can walk you through the trace?

Thank you,
Alex.

agostrer wrote on Thursday, December 06, 2018:

And here is our RTOS configuration:

#ifndef FREERTOS_CONFIG_H
#define FREERTOS_CONFIG_H

/* For documentation for all the configuration symbols, go to:
 * http://www.freertos.org/a00110.html.
 */

#if defined (__GNUC__) || defined (__ICCARM__)
/* Important: put #includes here unless they are also meant for the assembler.
 */
#include <stdint.h>
#include <shal.h>
void assert_triggered( const char * file, uint32_t line );
#endif

#define configLIST_VOLATILE                     volatile

#define configUSE_PREEMPTION                    1
#define configUSE_TICKLESS_IDLE                 2
#define configUSE_IDLE_HOOK                     0
#define configUSE_TICK_HOOK                     0
#define configPRIO_BITS                         2
#define configCPU_CLOCK_HZ                      ( PLATFORM_CPU_CLK_HZ )
#define configTICK_RATE_HZ                      ( ( TickType_t ) 1000 )
#define configMAX_PRIORITIES                    ( ( uint32_t ) 10 )	// 0 == lowestt
#define configMINIMAL_STACK_SIZE                ( ( uint16_t ) 100 )

/* configTOTAL_HEAP_SIZE is not used when heap_3.c is used. */

#define configTOTAL_HEAP_SIZE                   ( ( size_t ) ( 18000 ) ) // was 17000 
#define configMAX_TASK_NAME_LEN                 ( 8 )
#define configUSE_TRACE_FACILITY                0
#define configUSE_16_BIT_TICKS                  0
#define configIDLE_SHOULD_YIELD                 1
#define configUSE_MUTEXES                       1
#define configQUEUE_REGISTRY_SIZE               0
#ifdef BOOTLOADER
#define configCHECK_FOR_STACK_OVERFLOW          0
#else
#define configCHECK_FOR_STACK_OVERFLOW          1
#endif
#define configUSE_RECURSIVE_MUTEXES             1
#define configUSE_MALLOC_FAILED_HOOK            0
#define configUSE_COUNTING_SEMAPHORES           1
#define configUSE_QUEUE_SETS                    1
#define configGENERATE_RUN_TIME_STATS           0
#define configENABLE_BACKWARD_COMPATIBILITY     1

/* Co-routine definitions. */
#define configUSE_CO_ROUTINES                   0
#define configMAX_CO_ROUTINE_PRIORITIES         ( 2 )

/* Software timer definitions. */
#define configUSE_TIMERS                        1
#define configTIMER_TASK_PRIORITY               ( 2 )
#define configTIMER_QUEUE_LENGTH                2
#define configTIMER_TASK_STACK_DEPTH            ( 80 )

/* Set the following definitions to 1 to include the API function, or zero
to exclude the API function. */
#define INCLUDE_vTaskPrioritySet                1
#define INCLUDE_uxTaskPriorityGet               1
#define INCLUDE_vTaskDelete                     0
#define INCLUDE_vTaskSuspend                    1
#define INCLUDE_xResumeFromISR                  1
#define INCLUDE_vTaskDelayUntil                 1
#define INCLUDE_vTaskDelay                      1
#define INCLUDE_xTaskGetSchedulerState          1
#define INCLUDE_xTaskGetCurrentTaskHandle       1
#define INCLUDE_uxTaskGetStackHighWaterMark     1	// GHF modified this for DEBUG  June 18, 2015
#define INCLUDE_xTaskGetIdleTaskHandle          0
#define INCLUDE_xTimerGetTimerDaemonTaskHandle  0
#define INCLUDE_pcTaskGetTaskName               1	// Alex: test buffers; GHF testing for tick times....
#define INCLUDE_eTaskGetState                   1
#define INCLUDE_xTimerPendFunctionCall          1

/* Normal assert() semantics without relying on the provision of an assert.h
header file. */

#ifdef BOOTLOADER
#define configASSERT(x)
#else
extern void shoofASSERT ( void );
#define configASSERT(x)  \
            if( ( x ) == 0 ) { shoofASSERT(); }
#endif

/* The lowest interrupt priority that can be used in a call to a "set priority"
function. */
#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY			0x0f

/* The highest interrupt priority that can be used by any interrupt service
routine that makes calls to interrupt safe FreeRTOS API functions.  DO NOT CALL
INTERRUPT SAFE FREERTOS API FUNCTIONS FROM ANY INTERRUPT THAT HAS A HIGHER
PRIORITY THAN THIS! (higher priorities are lower numeric values. */
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY	10

/* Interrupt priorities used by the kernel port layer itself.  These are generic
to all Cortex-M ports, and do not rely on any particular library functions. */
#define configKERNEL_INTERRUPT_PRIORITY 		( configLIBRARY_LOWEST_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )
#define configMAX_SYSCALL_INTERRUPT_PRIORITY 	( configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )


#if defined (__GNUC__) || defined (__ICCARM__)
#include <portmacro.h>
void vPortSuppressTicksAndSleep( TickType_t xExpectedIdleTime );
#endif
#define portSUPPRESS_TICKS_AND_SLEEP             vPortSuppressTicksAndSleep

//XXX: Temporary hook to get the malloc boundaries for different tasks
#ifndef BOOTLOADER
#define traceMALLOC(pvReturn, xWantedSize) \
    printf("\r\n Malloc start: 0x%08x; end: 0x%08x; size: 0x%04x ", (int)pvReturn, (int)pvReturn + (int)xWantedSize, (int)xWantedSize)
#endif

#endif /* FREERTOS_CONFIG_H */

rtel wrote on Thursday, December 06, 2018:

It is not easy to catch the problem (takes few hours/days) and we need
tickless so I just didn’t have time to try without tickless. Also we
adjusted the tickless to use with the L21 RTC.

Intention is to see if the issue is related to tickless or not.

The USB driver is from Atmel SDK. It calls its own critical sections:

define cpu_irq_enable() \

 do {                                       \
     g_interrupt_enabled = true;            \
     __DMB();                               \
     __enable_irq();                        \
 } while (0)

define cpu_irq_disable() \

 do {                                       \
     __disable_irq();                       \
     __DMB();                               \
     g_interrupt_enabled = false;           \
 } while (0)

This is only ok because it is globally disabling and re-enabling
interrupts (assuming the macros are used from inside something that
counts nesting, otherwise interrupts are blindly re-enabled by the
macro) and FreeRTOS only uses interrupt masking with the basepri
register - so there is no conflict with the driver and the RTOS’s
accesses to the hardware.

rtel wrote on Thursday, December 06, 2018:

Ah, re-reading your original post, I see you are using an M0, which
doesn’t have a base pri register, so the critical sections used by the
driver could indeed be a problem. If FreeRTOS is inside a critical
section when cpu_irq_enable() is called then the critical section will
be exited and you will have trouble. So:

  1. Is it possible that cpu_irq_enable() is called from inside a critical
    section? Probably not.
  2. Is cpu_irq_enable() called in a may that sets the interrupt mask back
    to whatever it was when cpu_irq_disable() is called? Or does it just
    enable interrupt regardless of whether they were originally enabled or not?

rtel wrote on Thursday, December 06, 2018:

A couple of comments on the configuration file:

  1. Set configCHECK_FOR_STACK_OVERFLOW to 2, not 1.
  2. How often is printf() called (how often is memory allocated?).
    Calling printf() is one of the most common causes of issues in a
    multithreaded environment with very small stacks.

agostrer wrote on Thursday, December 06, 2018:

I see your point. Let me check how cpu_irq_disabled/enabled are used.
Would love to try your suggestion with tickless disabling and a new RTOS but with testing it may take over a week.
Thank you,
Alex.

agostrer wrote on Friday, December 07, 2018:

Hi Richard,
So theoretically using two different critical sections may bring a problem - I defined SDK functions to use the FreeRTOS functions. It didn’t solve the problem. It started to happen even more often (probably coincedense). Do you want to look with me into the trace file? IMO, it is pointing to a hole in the OS.
Thank you,
Alex.

agostrer wrote on Friday, December 07, 2018:

We allocate memory just in init. Then the malloc() never called again. Never call free().
We do use printf for debugging. Messages are short. I can disable all of them but I doubt that it will help - I do not see any call to printf while analyzing the trace buffer.

rtel wrote on Sunday, December 09, 2018:

I can disable all of
them but I doubt that it will help -

Did it change the behaviour?

I do not see any call to printf
while analyzing the trace buffer.

Doesn’t have to be for it to have caused an issue.

agostrer wrote on Sunday, December 09, 2018:

Didn’t try it yet. On Friday I switch the USB driver into another mode (IP over USB rather than single characters). Still testing. Didn’t see an exception for 37 hours.