Hi,
After my program runs for some time it suddenly stops on a HardFault. Checking the call stack seems like the xQueue is corrupted but the original SemaphoreId passed in is correct and not corrupted. I don’t get how it is even possible.
Semaphore is defined statically and configASSERT is enabled.
Did you also enable FreeRTOS - stacks and stack overflow checking ?
Is LOG only called from tasks and not from ISRs ?
Seems there is a memory corruption maybe due to stack overflow (or a call in an ISR).
In addition since you need to protect the log obj you should use an osMutex instead of a semaphore.
Also you should avoid excessive calls to (costly) strlen and just use the return value of the printf-functions :
…
int len = snprintf(obj.buffer, LOG_BUFFER_LEN, "%010ld\t[%s]\t%s:%d ", osKernelGetTickCount(), LogTypeToStr(type), file, line);
len += vsnprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, fmt, args);
len += snprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, "\r\n");
va_end(args);
obj.txHandler(obj.buffer, len);
The configCHECK_FOR_STACK_OVERFLOW is set to 2 and no overflow is detected. Also I can check from the XRTOS data that every task has enough all time low free space in it’s stack. On the other hand it seems like the semaphore integrity is intact from the memory snapshot I attached in the original post.
LOG is used from various places but it’s not called from ISR. Also CMSIS wrapper for acquiring semaphore handles the API calls from ISR.
Note that various FreeRTOS calls and also the CMSIS wrappers like osSemaphoreAcquire provide return values indicating success or error.
You should check them to see if the desired operation was done as expected.
Sure, but in this specific case it shouldn’t lead to a crash. If API returns with error it would try to send data on an already busy bus and it would fail in HAL layer
And could corrupt the obj data etc. More important is that the root cause problem for the memory corruption probably happened somewhere else and/or earlier.
Hence you should try to write robust code and that starts with checking (error) return values where possible or do a configASSERT at least.
and consider to use the better suited osMutex because you want/have to mutual exclude multiple LOG callers (with priority inheritance).
I still think it is a memory corruption (in fact a corruption of the FreeRTOS internal semaphore data structure with the symptom that the queue handle is damaged).
Beware that this probably happened somewhere else in your code. The LOG code is pretty straight forward and seems ok so far. Try to review and maybe improve your remaining application code in a similar way and try to narrow it down by excluding code parts. Also HAL can be buggy…
It might help if you can apply a data breakpoint at the queue handle member of the semaphore/mutex structure if supported by your debugger. Then you could trap an unwanted access/overwrite of it.
BTW do you have dynamic memory allocation enabled ? If yes, which heap implementation is used ?
After spending days to debug the issue I finally found the root cause:
Steps taken:
Blaming myself for memory/stack overflows
Tried heap4, heap1 and newlib for memory allocation
Refactoring all RTOS components using static allocation
Tracing all the failed RTOS structs in memory and couldn’t find any corruption
Enabling configASSERT, stack monitor and RTOS structs integrity check with no luck
Finally with some luck came across https://forums.freertos.org/t/hardfault-on-arm-cortex-m0/15423/11 and check the errata sheet for my MCU (same MCU as the post) and YES! cross prefetching instruction from different banks of Flash leads to corruption of CPU registers/context sometimes. Thats why passed struct address and contents to an API was fine but then inside the API r0 was 0 instead of a correct pointer. (This should be fraud since ST sells the chip as a 256K flash MCU and practically you can’t have a bigger than 128K single application on it! or you should disable the prefetch)
I had a bootloader at the beginning of first bank and the application code was spilled to second bank of the flash.
Fix: linked the bootloader to bank1 and the app to bank2 and now it works reliably.