Semaphore Aquire leads to hardFault

kapouchima · June 22, 2024, 7:13pm

Hi,
After my program runs for some time it suddenly stops on a HardFault. Checking the call stack seems like the xQueue is corrupted but the original SemaphoreId passed in is correct and not corrupted. I don’t get how it is even possible.
Semaphore is defined statically and configASSERT is enabled.

static struct
{
  LOG_Transmitter txHandler;
  osSemaphoreId_t logTxSemId;
  char buffer[LOG_BUFFER_LEN];
} obj;

static const char* LogTypeToStr(uint8_t type)
{
  ...
}

void LOG(const char* file, uint16_t line, uint8_t type, const char* fmt, ...)
{
#if (LOG_ENABLE != 0)
  if ((obj.txHandler != NULL) && (fmt != NULL))
  {
    va_list args;
    va_start(args, fmt);
    osSemaphoreAcquire(obj.logTxSemId, osWaitForever);
    snprintf(
      obj.buffer, LOG_BUFFER_LEN, "%010ld\t[%s]\t%s:%d   ", osKernelGetTickCount(), LogTypeToStr(type), file, line);
    vsnprintf(&obj.buffer[strlen(obj.buffer)], LOG_BUFFER_LEN - strlen(obj.buffer), fmt, args);
    snprintf(&obj.buffer[strlen(obj.buffer)], LOG_BUFFER_LEN - strlen(obj.buffer), "\r\n");
    va_end(args);

    obj.txHandler(obj.buffer, strlen(obj.buffer));
  }
#endif
}

void LOG_init(LOG_Transmitter tx)
{
  obj.txHandler = tx;
  static StaticSemaphore_t semCtb;
  static osSemaphoreAttr_t logTxSemAttr = {.name = "LogTxSem", .cb_mem = &semCtb, .cb_size = sizeof(semCtb)};
  obj.logTxSemId = osSemaphoreNew(1, 1, &logTxSemAttr);
}

void LOG_txDone(void)
{
  osSemaphoreRelease(obj.logTxSemId);
}

pxQueue is 0x2 which is not a correct address but why reading it leads to HardFault is still a another weird thing.

NOTE. MCU is a CortexM0 so all interrupts are disable regardless of priorities in critical section I assume

hs2 · June 23, 2024, 6:33am

Did you also enable FreeRTOS - stacks and stack overflow checking ?
Is LOG only called from tasks and not from ISRs ?
Seems there is a memory corruption maybe due to stack overflow (or a call in an ISR).

In addition since you need to protect the log obj you should use an osMutex instead of a semaphore.
Also you should avoid excessive calls to (costly) strlen and just use the return value of the printf-functions :

…
int len = snprintf(obj.buffer, LOG_BUFFER_LEN, "%010ld\t[%s]\t%s:%d   ", osKernelGetTickCount(), LogTypeToStr(type), file, line);
    len += vsnprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, fmt, args);
    len +=  snprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, "\r\n");
    va_end(args);

    obj.txHandler(obj.buffer, len);

kapouchima · June 23, 2024, 8:56am

The configCHECK_FOR_STACK_OVERFLOW is set to 2 and no overflow is detected. Also I can check from the XRTOS data that every task has enough all time low free space in it’s stack. On the other hand it seems like the semaphore integrity is intact from the memory snapshot I attached in the original post.

LOG is used from various places but it’s not called from ISR. Also CMSIS wrapper for acquiring semaphore handles the API calls from ISR.

hs2 · June 23, 2024, 9:22am

Note that various FreeRTOS calls and also the CMSIS wrappers like osSemaphoreAcquire provide return values indicating success or error.
You should check them to see if the desired operation was done as expected.

kapouchima · June 23, 2024, 9:30am

Sure, but in this specific case it shouldn’t lead to a crash. If API returns with error it would try to send data on an already busy bus and it would fail in HAL layer

hs2 · June 23, 2024, 9:55am

And could corrupt the obj data etc. More important is that the root cause problem for the memory corruption probably happened somewhere else and/or earlier.
Hence you should try to write robust code and that starts with checking (error) return values where possible or do a configASSERT at least.

kapouchima · June 23, 2024, 10:28am

Refactored log as you suggested. Also ConfigASSERT is implemented.

void LOG(const char* file, uint16_t line, uint8_t type, const char* fmt, ...)
{
#if (LOG_ENABLE != 0)
  if ((obj.txHandler != NULL) && (fmt != NULL))
  {
    va_list args;
    va_start(args, fmt);
    if (osSemaphoreAcquire(obj.logTxSemId, osWaitForever) != osOK)
    {
      goto error;
    }

    int32_t len = snprintf(
      obj.buffer, LOG_BUFFER_LEN, "%010ld\t[%s]\t%s:%d   ", osKernelGetTickCount(), LogTypeToStr(type), file, line);

    if (len < 0)
    {
      goto error;
    }

    {
      int32_t newLen = vsnprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, fmt, args);
      if (newLen < 0)
      {
        goto error;
      }
      len += newLen;
    }

    {
      int32_t newLen = snprintf(&obj.buffer[len], LOG_BUFFER_LEN - len, "\r\n");
      if (newLen < 0)
      {
        goto error;
      }
      len += newLen;
    }
    va_end(args);

    obj.txHandler(obj.buffer, len);
  }
error:
#endif
}

void LOG_init(LOG_Transmitter tx)
{
  obj.txHandler = tx;
  static StaticSemaphore_t semCtb;
  static osSemaphoreAttr_t logTxSemAttr = {.name = "LogTxSem", .cb_mem = &semCtb, .cb_size = sizeof(semCtb)};
  obj.logTxSemId = osSemaphoreNew(1, 1, &logTxSemAttr);
}

void LOG_txDone(void)
{
  osSemaphoreRelease(obj.logTxSemId);
}

Still I’m not sure if the crash is due to memory corruption

hs2 · June 23, 2024, 11:34am

and consider to use the better suited osMutex because you want/have to mutual exclude multiple LOG callers (with priority inheritance).
I still think it is a memory corruption (in fact a corruption of the FreeRTOS internal semaphore data structure with the symptom that the queue handle is damaged).
Beware that this probably happened somewhere else in your code. The LOG code is pretty straight forward and seems ok so far. Try to review and maybe improve your remaining application code in a similar way and try to narrow it down by excluding code parts. Also HAL can be buggy…
It might help if you can apply a data breakpoint at the queue handle member of the semaphore/mutex structure if supported by your debugger. Then you could trap an unwanted access/overwrite of it.
BTW do you have dynamic memory allocation enabled ? If yes, which heap implementation is used ?

kapouchima · June 23, 2024, 1:10pm

Tx handler uses DMA and it’s calling the LOG_txDone from ISR. That’s why Mutex is not used.

Some USB libraries from ST is using dynamic allocation but all RTOS components from my code is using static allocation, heap1 algorithm is used.

hs2 · June 23, 2024, 1:19pm

Hopefully you don’t have a problem with the ST USB stack like STM32-RTOS-USB-HowToFix/README.md at master · gmgunderground/STM32-RTOS-USB-HowToFix · GitHub or other 3rd party code.
Good luck !

kapouchima · June 23, 2024, 9:48pm

After spending days to debug the issue I finally found the root cause:

Steps taken:

Blaming myself for memory/stack overflows
Tried heap4, heap1 and newlib for memory allocation
Refactoring all RTOS components using static allocation
Tracing all the failed RTOS structs in memory and couldn’t find any corruption
Enabling configASSERT, stack monitor and RTOS structs integrity check with no luck

Finally with some luck came across https://forums.freertos.org/t/hardfault-on-arm-cortex-m0/15423/11 and check the errata sheet for my MCU (same MCU as the post) and YES! cross prefetching instruction from different banks of Flash leads to corruption of CPU registers/context sometimes. Thats why passed struct address and contents to an API was fine but then inside the API r0 was 0 instead of a correct pointer. (This should be fraud since ST sells the chip as a 256K flash MCU and practically you can’t have a bigger than 128K single application on it! or you should disable the prefetch)

I had a bootloader at the beginning of first bank and the application code was spilled to second bank of the flash.

Fix: linked the bootloader to bank1 and the app to bank2 and now it works reliably.

@hs2 Thanks for your time and help

aggarg · June 24, 2024, 5:57am

Thank you for sharing your solution!