Hi-Freq iIRQ causes memory corruption during context switching

rustrannik wrote on Friday, August 11, 2017:

Hello,

I am working on a project based on STM32F429 for quite a while now. Recently I’ve started noticing rare system hangs.
I decided to check utilization of system resources and enabled the following:

extern volatile uint32_t Clock_Counter;
extern void Init_TIM4(void);

#define configUSE_TRACE_FACILITY				1
#define configGENERATE_RUN_TIME_STATS			1
#define portCONFIGURE_TIMER_FOR_RUN_TIME_STATS()(Init_TIM4())
#define portGET_RUN_TIME_COUNTER_VALUE()		(Clock_Counter)

For the run time stats I use general purpose Timer4, which generates interruption at 10kHz (according to recommendation) and does the following:

void TIM4_IRQHandler(void) {
	if (TIM_GetITStatus(TIM4, TIM_IT_Update)) {
		TIM_ClearITPendingBit(TIM4,TIM_IT_Update);
		++Clock_Counter;
	};//if (TIM_IT_Update)
};//TIM4_IRQHandler()

IRQ priority was originally set to 1, which is very high, however, since no RTOS API used, so it shouldn’t matter.
And that was the beginning of disaster: system started constantly hanging at random time (from 18ms to 18minutes) in random spots, causing random consequences (BusFault, HardFault, dropping to Default_Handler, all sorts of assert_failed() in different spots).
I tried to set other priorities from 0 to F, but it had almost no effect, only hanging frequency changes a little bit.
RTOS priorities are set as follows:

#define configPRIO_BITS							4
#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY			0xF
#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY	5
#define configKERNEL_INTERRUPT_PRIORITY		 ( configLIBRARY_LOWEST_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )
#define configMAX_SYSCALL_INTERRUPT_PRIORITY	 ( configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

I updated FreeRTOS to v9.0.0 -> No effect
I removed all other functionality and tasks, only 1 task with 1 semaphore left and 1 timer IRQ, and it still hangs all the time. Disabling one of them (Task Scheduler or IRQ) leaves system 100% stable, but useless.

Task code:

void Task_USART2_Comm (void *par) {
	xSemaphoreGive(SemId_USART2_Tx);
	for(;;) {
		if (xSemaphoreTake(SemId_USART2_Rx,500) == pdPASS) {
			//all removed
		} else {//if (msg received successfully)
			//all removed
		};//else if (msg received successfully)
		LedTog(LED_BLUE1);
	};//inf loop
};//UART2_Comm_Task()

main() code:

int main(void) {
	Init_Clock();
	Init_LEDs();
	Init_Buttons();

	SCB->SHCSR |= SCB_SHCSR_BUSFAULTENA_Msk; // Enabling BusFault hook

	SystemInit();
	SystemCoreClockUpdate();

	NVIC_PriorityGroupConfig( NVIC_PriorityGroup_4 );


	if ((SemId_USART2_Rx	= xSemaphoreCreateCounting(16,0)) == NULL) { goto InitFailed; };//;//
	if ((SemId_USART2_Tx	= xSemaphoreCreateBinary()) == NULL) { goto InitFailed; };

	xTaskCreate(Task_USART2_Comm,		"U2C",	8*configMINIMAL_STACK_SIZE, NULL, tskIDLE_PRIORITY + 1, &TaskHandler);
	if (TaskHandler == NULL) { goto InitFailed; };

	vTaskStartScheduler();
	// normally should never get here

InitFailed:
	for(int i=0;;) {
		if (++i >= 1000000) { i=0; LedTog(LED_RED2); };
	};//inf loop trap
};//main()

Debugging showed that system crashes right after leaving timer irq.

Any thoughts?
Thanks in advance…

rtel wrote on Saturday, August 12, 2017:

There is a bit much code to look through there, but a couple of points:

  1. As no RTOS calls are made in the high frequency interrupt the
    priority doesn’t matter (as you stated in your post).

  2. In some of the FreeRTOS demos there is a run time stats
    implementation that just reads the systetick clock count value, which is
    always running anyway. It is quite complex as the count is very fast so
    needs scaling, and it counts down rather than up, but it avoids the need
    for a fast interrupt. The other thing you could do is set the timer to
    a frequency that allows you to read its count value directly, rather
    than count the interrupts it generates. You would of course need to
    take care of overflows.

  3. Are you using the ST Cube/HAL software? If so then the HAL tick and
    the RTOS tick need to share the SysTick interrupt - and as the HAL
    drivers sometimes poll the tick count value from inside the ISR (yes, I
    know) then the SysTick interrupt has to be the HIGHEST possible value,
    rather than the lowest. Keep the PendSV interrupt at the lowest though.

Hope that helps.

rustrannik wrote on Monday, August 14, 2017:

Thank you for advices and sorry for the excessive code! I thought I might have made a very stupid mistake which I can’t see, since I look at it for months, but a new person would immediatelly tell “aha! there is a typo!” or something like that.

As for stats, I measured utilization, which showed that >90% of resources are still free, so the stats by themselves are not needed anymore, howerver, I have plenty of other interruptions, like 3 UARTS, ADC, CAN, DMA, I2C, SPI, 1 more timer, etc., but they just happen not as frequently as Timer4, so hangs were very rare, but still happened, and I cannot reset system at random times, so I need to find a way to make it 100% stable.

I use latest standard peripheral drivers, so Cube/HAL are not the case.

Currently, I am suspicious about the hardware, so I will replace the chip and compile the same code for other device and other platform and post the results here.

rustrannik wrote on Monday, August 14, 2017:

Found that my 16MHz crystall was somehow causing this instability on the chip. Everything started working flawlessly after the replacement…

rtel wrote on Monday, August 14, 2017:

Thanks for taking the time to report back.

gezab wrote on Monday, August 14, 2017:

I am really curious how such a hw change can effect functionality in such a
way. Can you explain?

On 14 Aug 2017 11:14 pm, “Real Time Engineers ltd.” rtel@users.sf.net
wrote:

Thanks for taking the time to report back.

Hi-Freq iIRQ causes memory corruption during context switching
https://sourceforge.net/p/freertos/discussion/382005/thread/5ec3bfaf/?limit=25#b1d5/cce4

Sent from sourceforge.net because you indicated interest in
SourceForge.net: Log In to SourceForge.net

To unsubscribe from further messages, please visit
SourceForge.net: Log In to SourceForge.net

richard_damon wrote on Tuesday, August 15, 2017:

If the crystall was causing to processor to work outside its operating limits, and thus ‘making mistakes’ it is quite possible.