UART Glitch On Startup Causes Hardfault in FreeRTOS

joe · July 27, 2020, 9:41pm

Hello,

I have a tough one here. Full descriptive subtitle would read:
UART glitch (rx line goes high low to high to create an error frame) on startup after UART init of STM32F4 causes a hardfault in a single threaded FreeRTOS 7.5.3 application some time after UART communications finish.

The exact hardfault varies based on the code executing once the conditions align. I see INVPC, INSTATE, UNDEFINSTR. UNDEFINSTR is currently the most repeatable. It occurs during the use of a looped CRC function. The hardfault occurs 200 usec after the start of UART traffic and 70 usec after the last traffic is received.

If am reading the PC correctly, it is always on the branch call to the CRC function.

If I inline the CRC function to remove the branch call, there is no hardfault. However, there is corruption in the CRC hash.

Removing FreeRTOS, commenting out the vPortYield() and vPortDelay(), and directly calling the single task results in no hardfault during the CRC loop.

configASSERT is defined and everything appears to check out. Plenty of stack. No other fails.

The interrupts that are enabled are for the UART and a General Purpose Timer. The timer is used for timeouts.

What do you think?

Cheers,
Joe

jefftenney · July 27, 2020, 11:18pm

A few things to check – interrupt priority assignments and preemption priority bits as noted here RTOS for ARM Cortex-M. Also good to double check the FAQ here FreeRTOS - Open Source RTOS Kernel for small embedded systems. Also if you can upgrade to v10 there are many more programmatic checks that might quickly identify a configuration issue. The upgrade is generally easy due to the importance FreeRTOS places on backward compatibility.

By “traffic” do you mean “glitch”? Your statement implies the UART traffic lasts only 130us, so I’m wondering if it’s a glitch or perhaps you are using very high baud rates.

By “communications” do you mean “glitch”? Is there a glitch on UART startup that causes an immediate hardfault, or is there a glitch, some proper UART communication, and then a hardfault?

Can you post your UART ISR to help us see your design a little more clearly?

joe · July 28, 2020, 6:22pm

Jeff,

Thank you for writing! I’ll respond to the quick ones here and then report back as I dig through the more complex things to check.

v10 may be useful for debugging. The legacy code lives in v7.5.2 and will likely stay there barring major issue.

The baud rate is 2 MBaud (2,000,000).
The glitch occurs before the traffic. See below from 2.5s to 5s on the MAIN RX label for the glitch (0x00 FRAMING ERROR for the initialized UART):

The traffic is four 2-byte packets: two RX and two TX. Typically about 150 usec start to finish.

Your last option is what we see. Power on, init UART, glitch, proper UART comms, then a hardfault.

I can get you the UART ISR. We are using DMA for RX so there is not much to it. I also performed a test which moved the enabling of the NVIC for UART from before the glitch to after, which still hardfaulted: indicating the problem is not likely the ISR.

Worth noting that the US1_ErrorISR() does fire for the FRAME ERROR. It cannot be masked due to DMA being enabled. Sidebar: This appears to be a minor TRM inconsistency in the logic diagram for interrupt masks and the EIE behavior with DMA.

void USART1_IRQHandler(void)
/*
// Determines the cause of the USART1 interrupt request and calls the
// appropriate IRQ handler.
//
*/
{
uint32_t cr1reg;
uint32_t us1_status;

  us1_status = USART1->SR;    // read the usart`s status register
  cr1reg = USART1->CR1;       // read USART1`s control reg 1 for int enables
                              // the transmit interrupts are only enabled when
                              // data is placed in the tx buffer for transmission

  if(us1_status & 0x0f)       // if an error flag is set
    US1_ErrorISR(us1_status); // handle the error

  else
  {
    // transmit machine including the shift register is empty
    if(us1_status & BIT6 && (cr1reg & BIT6))
    {
      DISABLE_TCIE1;          // disable the IRQ pending bit in the usart
      Tx1_active = FALSE;     // set the transmission complete flag for the app
    }

    // transmit buffer register empty interrupt
    else if(us1_status & BIT7 && (cr1reg & BIT7))
      US1_TxISR();
  }
}

void US1_TxISR(void)
/*
// USART1 Transmitter interrupt service routine. Retreives a char from the
// transmitter ring buffer and writes it to the transmitter data register.
//
*/
{
  // if the Tx buffer is empty flag the transmitter as inactive
  // and return
  if(us1_txbuf_out == us1_txbuf_in)
  {
    us1_txbuf_count = 0;
    DISABLE_TXIE1;            // disable the tx interrupt until needed again
    return;                   // and exit
  }

  // Tx buffer not empty, transmit the next byte
  USART1->DR = us1_txbuf[us1_txbuf_out++];
  --us1_txbuf_count;          // decrement the tx buffer counter

  // check buffer array bounds and reset if necessary
  if(us1_txbuf_out > MAX_US1BUFS)
    us1_txbuf_out = 0;
}

void US1_ErrorISR(uint32_t status)
/*
// USART1 Error interrupt service routine. Ackowledge the error and read the
// receiver data register to reset the IRQ pending bit. The offending data
// is thrown away leaving the input parser to deal with the bad and incomplete
// data sequence. The if() statements are left over debugging artifacts.
//
*/
{
static uint8_t ecnt = 0;
uint8_t rch, i;

  i = 0;

  rch = USART1->DR;

  if(status & BIT0)       // if a parity error occurred
    i = 1;

  if(status & BIT1)       // if a framming error occurred
    i = 2;

  if(status & BIT2)       // if a noise error occurred
    i = 3;

  if(status & BIT3)       // if a overrun error occurred
    i = 4;

  dterr("rch %02x, ecnt %02x, status %02x <overrun, noise, frame, parity>", rch, ecnt, status);
  // all this code does is prevent a compile warning
  if(i > 0)
    --i;

  if(rch > 0)
    --rch;
}

Onward.

Cheers,
Joe

joe · July 28, 2020, 6:23pm

Here is the zoomed in plot of the traffic and the hardfault indicated by the MAIN TOGGLE going high:

Posted in second post due to new user restrictions.

jefftenney · July 28, 2020, 7:23pm

If removing FreeRTOS seems to solve the issue, maybe your task doesn’t have enough stack. You set the stack size for a task in the xTaskCreate() call, and you may be setting it too small. Without FreeRTOS, you’re using the main stack which as you say is probably huge.
If the break symbol (the glitch) has anything to do with the hard fault, then you may want to look closely at how it’s handled. For example, is the DMA processing it as a regular incoming byte (value of 0) in spite of your error handler attempting to dump the byte? Does your input parser and DMA logic manage that OK?
Noticed you’re not clearing the error flags.
The suggestion to use FreeRTOS v10 is helpful even if just to experiment. It will verify for you that interrupt priorities are valid and a few other things. These are the very things that can cause exactly the problems you are experiencing.

joe · July 28, 2020, 9:29pm

Jeff,

Went from 1k to 4k+ task stack size. Ozone reports we are about 120 bytes deep into the stack when it hardfaults. No change.
Technically, it’s a FRAMING ERROR due to the lack of stop bit for this guy. DMA processes it as a regular 0 byte. A more clever implementation would find a way to skip past the byte in the DMA buffer. Alas, at this point in the error isr, the null byte is not yet in the DMA buffer.
I am not sure what else to check on the DMA chain for what it might be affecting.
It is tricky, but the error flags are getting cleared. SR is read in the calling function and DR in the error function. TRM: “The FE bit is reset by a USART_SR register read operation followed by a USART_DR register read operation.”
v10 is dropped in and ported. A few definition tweeks and it worked out of the box. No change. Still hardfaults and no new configASSERT issues to see.

Regarding interrupt priority assignments and preemption priority bits per https://www.freertos.org/RTOS-Cortex-M3-M4.html
I copied from a CORTEX-STM32F407 example FreeRTOSConfig.h from the v10 DEMOs for the priority bits making the following additions/modifications:

configPRIO_BITS = 4 // from stm32f4 CM4F example
configKERNEL_INTERRUPT_PRIORITY was 255 0xff is now 240 0xf0
configMAX_SYSCALL_INTERRUPT_PRIORITY was 191 0xbf is now 80 0x50

Added prior to FreeRTOS init:
NVIC_PriorityGroupConfig( NVIC_PriorityGroup_4 ); - no change

Checked that the interrupt priorities are logically lower, numerically higher than the FreeRTOS thread (4):

DMA IRQ priority 10 - NVIC_SetPriority(DMA2_Stream2_IRQn, 10);
USART IRQ priority 10 - NVIC_SetPriority(USART1_IRQn, 10);
TIM2 IRQ priority 12 - NVIC_SetPriority(TIM2_IRQn, 12);

Reading through https://freertos.org/FAQHelp.html reminded me to try this:

portENTER_CRITICAL();
tmp2 = CalcAuth(((uint32_t)crc ^ fbuf->fbufb[i]) & 0xff);
portEXIT_CRITICAL();

As expected, this guard resolves the issue. Why?

Build-wise, this function call CalcAuth() is to a library.

When I trace the hardfault, it does appear to always be at the BL branch asm or similar.

Cheers,
Joe

joe · July 28, 2020, 9:55pm

I suspect it is when the GPTimer interrupt fires at the same time as the BL CalcAuth() branch occurs.

jefftenney · July 28, 2020, 10:11pm

My mistake. I made a bad assumption about which USART you have. I see now.

This seems to be a critical change. With your DMA interrupt priority of 10 (which is 0xA0 with the shift), it is numerically lower (and thus higher priority) than your original configMAX_SYSCALL_INTERRUPT_PRIORITY (0xBF). So if your DMA ISR makes calls to the FreeRTOS API, then this change was really important.

That is a surprising fix. May be worth more investigation. Could it be the change to configMAX_SYSCALL_INTERRUPT_PRIORITY was actually the fix? If not you may try to find the source code for that library function or maybe even step through it in the debugger.

joe · July 29, 2020, 10:33pm

Jeff,

None of the ISRs call FreeRTOS API as far as I can see.
I don’t see anything significantly different with that change in configMAX_SYSCALL_INTERRUPT_PRIORITY.
The guard prevents interrupts such as the GPTimer. If the interrupt is prevented in the area where the hardfault occurs, then it would make sense that this would work. The hardfault occurs with FreeRTOS after the UART RX and after the firing of the GPTimer ISR.

Based on that, I bumped up the GPTimer rate to hardfault closer to the initial insult.

Playing with the baremetal NVIC priority and FreeRTOS priorities, I see that FreeRTOS cannot seem to run at a higher priority than the baremetal ISR. I set the GPTimer to NVIC 15 (0xf), the lowest priority, and it still fires, regardless of my FreeRTOSConfig.h.

Trying to close in on exactly what’s necessary for the insult, I bisected code around the TX/RX of the packets. What I learned:

Only one exchange is necessary
UART RX must be enabled during the glitch
The code for performing the TX/RX is very specific to the hardfault insult - e.g. changing the -O optimization level for the function to O2 or higher seems to remove the hardfault condition; or inlining the code to the calling function also relieves the hardfault condition.

In other tests I learned:

Reading the RX buffer after the FRAMING ERROR just before the TX/RX packet exchange also stops the hardfault from occurring.
GPTimer NVIC priority does not seem to matter. 1 or 15. Both fire and eventually cause the hardfault. I suspect the low priority (15) interrupts during the IDLE or Kernel execution moments, since they are the lowest priority in FreeRTOS.
The most common hardfault is now “Usage fault: INVSTATE: Invalid combination of EPSR and instruction, such as calling a null pointer function”
Timeline. Big picture:

image1051×415 20 KB

(to be continued)

joe · July 29, 2020, 10:33pm

Zoomed in to the hardfault, I see GPT return 120 ns before next gpio toggle high just before the call to CalcAuth. Prior interrupts with 200 ns or greater before the next gpio toggle do not hardfault. The hardfault with 120 ns:

(to be continued)

joe · July 29, 2020, 10:34pm

Non hardfault with 200 ns:

It can also hardfault just prior to the return of CalcAuth (image omitted).

Perhaps this is the time during which the stack is getting manipulated for a context switch and it is getting interrupted.

How can dig deeper into that on the FreeRTOS side of things?

Cheers,
Joe

jefftenney · July 30, 2020, 1:44am

This would be a rare design in my experience. Recommend reading Richard’s book here especially the preface and chapter 6.

I’m not sure I agree. Interrupts and ISRs should not contribute to hardfaults any more than non-interrupt conditions and code.

I think you mean that no matter a FreeRTOS task’s priority, any interrupt and its associated ISR always interrupts the task. That is true. Task priorities are completely different from interrupt priorities. Chapter 3 of Richard’s book will help there.

I assume you are not talking about the USART’s RX buffer here – your code already reads that buffer after the framing error and before the TX/RX packet exchange. Instead I assume you are talking about an RX queue in software. If you pull the errant 0 byte out of the software queue before the parser consumes it, and if that solves your problem, that is an excellent clue.

No, I don’t think so. FreeRTOS protects itself against such things by design.

Regarding your logic analyzer images, can you tell us what the edges mean? Are you toggling per function execution? Are you pulsing during function execution? etc.

Also what can you tell us about CalcAuth()? Do you have source code for it? Does it utilize hardware assist (like a CRC or AES module or similar)?

Finally, you may need analyze the hard fault using guidance here:

joe · July 30, 2020, 7:39pm

Jeff,

Below is the GPTimer ISR, where h4l() and h4h() are the gpio port h pin 4 toggles for the GPTimer trace in above logic analyzer figures. No FreeRTOS API references.

void TIM2_IRQHandler(void)
/*
// TIMER2 ISR
//
// Manages and updates the 8 system software timers. If a system timer has not
// reached zero it is decremented.
//
*/
{
uint8_t i;
  h4l();
  h4h();
  // update the 8 system software timers
  for(i = 0; i <= MAX_TIMERS; i++)
  {
    if(Timers[i] > 0)         // if this timer is still counting down
    {
      --Timers[i];            // decrement this timer
      --*Tstatus[i];          // and this timer`s status variable

      if(Timers[i] == 0)      // if this timer has reached zero
        Tstatus[i] = 0;       // NULL this timer`s status pointer
    }
  }
  ++tick;
  TIM2->SR &= ~BIT0;          // reset TIM2 IRQ pending status
  h4l();
}

Okay. Let’s test the hypothesis of entry/exit stack issue. If we guard the entire call/return to CalcAuth(), we observe no hardfault. Now the next test might be to guard the entry or return instead of the entire function call and return. If it protects from the hardfault, then we are most likely getting a stack corruption from the interrupt somehow. This was also a speculated hypothesis from the ST FAE, so it is worth consideration.

Thank you for the reference. I am starting to grok the priorities. Though when I set the GPTimer to NVIC priority 1 (highest without going unmaskable), it runs all the time and on time. When it is 0xf, it does get blocked for a time and then eventually runs, though the timing is inconsistent. With a single FreeRTOS thread it seems odd. Not sure what to make of it given that it should “always” fire.

I assume you are not talking about the USART’s RX buffer here – your code already reads that buffer after the framing error and before the TX/RX packet exchange. Instead I assume you are talking about an RX queue in software. If you pull the errant 0 byte out of the software queue before the parser consumes it, and if that solves your problem, that is an excellent clue.

You are correct, I am talking about the circular buffer that DMA uses, not the hardware 1 byte buffer.

Notably, the compiler output for the C code of the function that handles those first two packets (and the null first byte) requires -O1 or lower to hardfault. -O2 and we see no hardfault. I’ll post the listings as screenshots of Ozone following this reply to see the differences which may be related to the hardfault.

My logic analyzer toggles for PH3 CalcAuth were supposed to be before/after the call/return of CalcAuth(). I discovered they were also before and after the calling function of CalcAuth(), so it could be either. Going high is the call, going low is the return.

CalcAuth() does not use hardware assist. It looks like this:

uint32_t CalcAuth(uint8_t dat)
/*
// Calculates an ongoing CRC on the byte sent in 'dat' and returns the
// calculated value to the caller. Each successive byte sent to this
// function is calculated into CRC of the previous byte stream.
//
*/
{
int i;
unsigned long crc;

  crc = dat;

  ++afbcount;

  if((afbcount % nth_BYTE) == 0)
    dat &= 0xfd;

  for(i = 8; i > 0; i--)
  {
    if(crc & 1)
      crc = ((crc >> 1) ^ CRC_POLYNOMIAL);

    else
      crc >>= 1;
  }

  return crc;
}

I’ll review the hardfault FreeRTOS guidance and get back later.

Cheers,
Joe

EDIT: added rolling count code

joe · July 30, 2020, 7:42pm

Removing my call to SEGGER_RTT_Printf debug exhibits no hardfault. Below is the code and assembly listing for the simplified function which performs the first packet exchange. -Og optimization and no hardfault:

joe · July 30, 2020, 7:43pm

Add the debug output call to code to a conditional block which is not executed and the hardfault occurs. -Og optimization, with rtt, hardfault observed:

joe · July 30, 2020, 7:51pm

Change from -Og to -O2 optimization, no hardfault observed. Here is the code and listing:

Note changes:

CMP BLS conditional is moved up in sequence,
order within the if clause of MOVS MOV is reversed to MOV MOVS,
the call to UpdSendShortCmd() goes from BL B to B.W NOP

Common to the no hardfault listings:

CMP BLS are the first two instructions

Cheers,
Joe

joe · July 30, 2020, 8:43pm

Guarding the function entry and exit of CalcAuth() with NVIC_DisableIRQ(TIM2_IRQn) ... NVIC_EnableIRQ(TIM2_IRQn) no hardfault is observed.

Guarding just one of either or neither of the entry or exits of CalcAuth() results in the hardfault
Usage fault: INVPC: Attempt to load EXC_RETURN into pc illegally

At this point, it always crashes when the GPTimer interrupt fires once CalcAuth() has started getting called.

jefftenney · July 30, 2020, 9:53pm

This does make sense. The CPU delays executing an ISR if the task code (or kernel code) has masked interrupts temporarily. The CPU also delays executing an ISR if a higher priority ISR happens to be executing. So the delays (often called ISR latency) you are seeing are probably normal – I assume the delays are relatively short.

Agree that you have some evidence of strange context corruption, but if FreeRTOS is contributing to the problem here, I’d be surprised. A “common” culprit of such things is if FreeRTOS’s interrupt vectors aren’t installed correctly. I think this issue is on the FAQ too though, with 3 #define statements in your FreeRTOSConfig.h file typically resolving the issue. On a “legacy” system like yours I guess I wouldn’t expect this kind of error. Worth a look though. (How many tasks do you have?)

Another suggestion for you. Most of the stimulus you have discovered to eliminate the hard faults also results in fewer calls to the RTT output or at least in changes to the timing of those calls. RTT code is not thread safe – not safe for reentrant calls – unless you make it safe. For example, if the RTT system is in the middle of sending something to the terminal and then the framing error occurs, your current code reentrantly calls the RTT system. That could easily cause memory corruption.

If I were you, I would eliminate all calls to the RTT system and then try everything (else) you’ve got to make a hard fault happen. At least you can disprove the RTT theory that way.

If it’s the RTT system corrupting memory, then perhaps the memory it corrupts has more significant consequences when FreeRTOS is used versus when it is not.

joe · July 30, 2020, 10:51pm

I entirely removed RTT dependency and repeat the hardfault. Replaced it with do {} while (0) which gets optimized out even on -Og.

The asm listing is now the same as the previous listing that had passed without hardfault. Perhaps suggesting an alignment issue.

Checking the FAQ and the v10 demo code on interrupt vectors. I am currently testing within v10 because of the greater tests and checks available (though I may not yet have enabled them all).

joe · July 30, 2020, 11:01pm

Checked the interrupt vectors are installed. SVC_Handler, PendSV_Handler, SysTick_Handler are in the listing as the FreeRTOS implementation.

In this troubleshooting branch, only one task is running (plus IDLE of course).

Cheers,
Joe