Issue with CAN ISR not activating

Hi Everyone!

I’m having an issue resolving a problem with an ISR using FreeRTOS 10.2.1 on an STM32F7 chip. The ISR is a straightforward process to take an incoming CAN message and put it onto a Queue using xQueueSendFromISR(). The problem I’m having is that making any change (both adding or commenting out application code) to a number of RTOS tasks in the system, even if they are unrelated to this ISR (this also includes some tasks that are suspended at the point I expect the interrupt to fire) causes the ISR to stop running entirely. I have another ISR that continues to work correctly throughout testing this issue (EXTI interrupts which also send data to a different queue). Here’s the list of things I’ve tested so far:

  • I have verified the bus integrity using both an external CAN-sniffer on the testing code and by reverting the changes to the last stable release.
  • I verified that the ISR does not run at all with both a debug breakpoint and using an onboard debug LED toggle
  • I ran a quick timing check with Tracealyzer to see if there were any events on the queue or something else pre-empting the ISR somehow
  • I checked the uxTaskGetHighWaterMark() using method 2 on each of the relevant tasks (each task had at least 80 bytes of the default 256 free), as well as the overall available heap memory
  • I verified the interrupt priority was not higher than the configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY - both are set to equal at 5, and additionally I tried decreasing the ISR priority (by raising the NVIC priority to 6)
  • I also tried changing the NVIC priority of the other working interrupt to 6 (both together with the broken ISR and separately)

Now the one part I have suspicion about with this ISR compared to the other is the way we enable it. In the current working code, we use a call to a startCAN() function which is called at the end of FreeRTOS_Init(). This function activates the various Notifications provided in the CAN HAL, sets up the CAN filter and calls the actual HAL_CAN_Start() function. I found that moving this to before the RTOS scheduler starts (but after the peripheral details are configured), also causes the same issue as described above. This includes adding a taskENTER_CRITICAL() section around this function to avoid issues with changing the enabled ISRs. This might just be a red herring, but it stuck out to me as something unique about the specific problem ISR.

At this point I’m both out of my depth as a junior engineer and out of ideas on what to try next - does anyone have any thoughts on what else could be causing this and/or tests for me to try?

Thank you,
Jamie

This seems totally strange that removing an unrelated task causes an interrupt to not fire. Seems like a symptom of a problem somewhere else.

Have you tried general debugging techniques?

Thanks.

Additionally - can you view the CAN peripheral’s registers in the debugger once it has stopped executing - that might give a clue as to why it stopped (maybe a data corruption turned it off, or there is some error condition on the perihperal that needs clearing, maybe the interrupt is disabled, etc.) - not that that explains the correlation between changing something unrelated and this simptom but it might give you a clue as to where to look.

Hello again!

Thanks for the suggestions - I spent today trying them out and unfortunately didn’t get to a solution. Here’s what I found:

  • configASSERT was already enabled and appears set up as provided - so we aren’t having an issue with an RTOS function misfiring
  • I re-ran the stack checking I tried earlier in trying to solve this problem but included every task - the configuration to check for stack overflow was already set, and using the getHighWatermark() function provided it each task had more than sufficient memory remaining - the minimum remaining words I saw returned was 80, and the rest were between 130-220 (of 256 allocated)
  • The malloc failed config was also already enabled and set up - doesn’t look like that is the cause either (we have the system set up to use heap 3)
  • Checking the CAN peripheral registers during operation also didn’t yield anything directly - the registers were the same in the “working” and “non-working” versions, and double checking against the datasheet shows that the bit for the RX interrupt is active in both cases.

One thing I did find was while checking the CAN registers I had one instance where just adding the breakpoint to the StartCAN() function mentioned above caused the system to work as expected, but I wasn’t able to reproduce that more than one time. That makes me think I have some sort of timing issue or race condition, but I couldn’t reproduce the timing (I also tried adding an arbitrary HAL_DELAY() to see if that made a difference). I had the same thing happen when adding the stack value checking into the default task - it worked when I checked the 18 application defined values, but went back to failing when I added an additional stack watermark check for the default task itself.

So it seems that I’ve reached a point where certain changes on the level of a single function call change the system operation. Anyone have an idea for more tests I can try?

Thanks Again,
Jamie

Then the interrupt is probably masked in non-working case. Is any interrupt working at that time? You can put a breakpoint at xPortSysTickHandler or xPortPendSVHandler in port.c and see if those are triggered. Also check the value of the BASEPRI register which is used to masked some interrupts.

Thanks.