FreeRTOS+TCP with BufferAllocation_1.c HardFault

abelo · April 5, 2023, 6:53am

Hey,

I’ve been struggling for a while now with a HardFault occurring with FreeRTOS+TCP v2.3.2-LTS.Patch.1 in use.

In our project we are using Renesas RA6E1 MCU and most of the source code used to interface with the FreeRTOS+TCP on a low level originates from the Renesas FSP (version 3.6.0). The NetworkInterface.c we’re using can be found from https://github.com/renesas/fsp/tree/master/ra/fsp/src/rm_freertos_plus_tcp. For buffer management, I’ve chosen to use BufferAllocation_1.c for stability.

So this is the issue in a nutshell: when using DHCP, our device works fine, but switching to static IP, the device occasionally crashes into HardFault handler after some seconds of operation.

I have DHCP enabled with the following preprocessor macros:

#define ipconfigUSE_DHCP 1
#define ipconfigDHCP_REGISTER_HOSTNAME 0
#define ipconfigDHCP_USES_UNICAST 1
#define ipconfigDHCP_SEND_DISCOVER_AFTER_AUTO_IP 1
#define ipconfigUSE_DHCP_HOOK 1

When using static IP, I pass the IP address to FreeRTOS_Init() function and return eDHCPUseDefaults from xApplicationDHCPHook in eDHCPPhasePreDiscover phase as suggested in these forums in a couple of posts:

https://forums.freertos.org/t/freertos-tcp-dhcp-or-static-in-runtime/8445/6
https://forums.freertos.org/t/freertos-tcp-how-can-i-switch-dhcp-on-off-at-runtime/2733/3

Most of the time the device works just fine with static IP.

It seems that this issue is related to some sort of memory violation, and the reason we’ve only seen it happen with static IP could very well be related to the timing differences in the initialization.

Here are some of our findings so far:

The crash doesn’t seem to occur when:

Using DHCP
Ethernet cable is removed
Disabling FreeRTOS+TCP init altogether
Disabling only FreeRTOS+TCP RX thread

When the crash occurs:

The scheduler is in the process of switching tasks:
If LR points to an invalid memory area, PC points to PendSV_Handler OR
If PC points to an invalid memory area, LR points to our custom xTaskCallApplicationTaskHookParam, when it’s trying to dereference the pxCurrentTCB
The task being switched out is the IP RX task (implementation of the task can be found from the NetworkInterface.c: https://github.com/renesas/fsp/blob/master/ra/fsp/src/rm_freertos_plus_tcp/NetworkInterface.c

So basically the task control block of the task we’re trying to return to, has been corrupted by something.

We’ve also suspected that this issue might happen because of some network buffer mismanagement, but I haven’t been able to find any clear issues with the buffers and how they’re initialized. And it’s worthwhile to remember that most of the time everything works just fine, and we haven’t been able to reproduce this issue when DHCP is enabled.

There. Quite a long intro, and I have more data and logs regarding this issue, but perhaps this is enough for the first post. I’ll happily provide more info if needed.

Any suggestions on how to catch this issue?

Br,
Abel

abelo · April 5, 2023, 11:59am

After trying to trace this down with the debugger, I found out the following:

When I monitored the pxReadyTasksLists in tasks.c prior to the crash, I noticed that the pxIndex field of pxReadyTasksList[6] was no longer pointing to an item in the pxReadyTasksLists:
03-hard_fault-pvOwner_false
Figure 1

So when the taskSELECT_HIGHEST_PRIORITY_TASK macro is called, and it updates the pvCurrentTCB based on the pvOwner field of the list item assuming that it contains a pointer to a task’s TCB, the application will actually jump to an arbitrary address (0x17 in the above case) and crash to HardFault.

I’m assuming at this point that at any given time, the pvOwner of the next uxNumberOfItems in the pxReadyTasksLists[n] should point to the beginning of the TCB of a task with priority n. This didn’t seem to be the case prior to the error shown in Figure 1. I checked the other priority tasks that were supposed to be in READY state as well and the pvOwner values were off for many of those as well, as seen in Figure 2 and Figure 3:
04-hard_fault-pvOwner_false
Figure 2
05-hard_fault-pvOwner_false
Figure 3

It’s worthwhile to mention, that the task priorities in our system are as follows:

prvRXHandlerTask - Priority 6 (highest)
prvCheckLinkStatusTask - Priority 6 (highest)
prvIPTask - Priority 5
Rest of the tasks in the system have a priority 4 or lower

aggarg · April 5, 2023, 1:50pm

Thank you for the detailed investigation.

Your conclusion from the observations seems right and it does look like a memory corruption. Lets start with the following -

Define configASSERT.
Enable stack overflow checking - FreeRTOS - stacks and stack overflow checking

For a quick check, can you try increasing the stack sizes (like 4x or something)?

abelo · April 5, 2023, 5:35pm

Thanks for the swift reply. I actually have configASSERT defined as such:

#define configASSERT(x)       \
  if ((x) == 0)               \
  {                           \
    portDISABLE_INTERRUPTS(); \
    for (;;)                  \
      ;                       \
  }

And I also have the stack overflow checking enabled with the following setting:

#define configCHECK_FOR_STACK_OVERFLOW 2
void vApplicationStackOverflowHook(TaskHandle_t pxCurrentTCB, char *pcTaskName)
{
	systemFatalError("Stack Overflow");
}

Where systemFatalError() is a function that disables interrupts and prints out the error message.

Neither of these have however triggered with this issue.

Perhaps it’s worthwhile to mention that much of our codebase is in C++ and tasks that utilize the FreeRTOS+TCP are mostly wrapped in C++ objects.

I did try to disable all tasks that use the FreeRTOS+TCP stack to see if we have a race condition in initializations etc. but the crashes still occurred when only the FreeRTOS+TCP related tasks and some supporting non-TCP/IP -related tasks were in use.

Currently I’ve set the stack sizes for the IP-related tasks as follows:

prvIPTask - 5120 bytes
prvRXHandlerTask - 2560 bytes
prvCheckLinkStatusTask - 512 bytes

I tried increasing the stack sizes at some point to no avail, but now I don’t remember how much I increased them. I’ll try again increasing the stack sizes of the previously mentioned tasks to 8kB, 4kB and 4kB correspondingly to confirm this. I’ll report back the results. Thanks!

abelo · April 5, 2023, 5:58pm

The system still crashes with the stack sizes increased as follows:

prvIPTask - 8192 bytes
prvRXHandlerTask - 4096 bytes
prvCheckLinkStatusTask - 4096 bytes

I could try to again disable other tasks to eliminate the possibility of some random timer event etc. from messing things up. Although I don’t see how this would fix the issue considering that the issue only occurs when using static IP. And there isn’t much functionality regarding DHCP vs. static IP handling outside of the prvIPTask

aggarg · April 6, 2023, 4:09am

The other possibility is incorrectly configured interrupt priority. Can you stop interrupts you have and see if it still happens? Can you also set configUSE_LIST_DATA_INTEGRITY_CHECK_BYTES to 1 in your FreeRTOSConfig.h?

abelo · April 6, 2023, 5:34am

I’ve set the configUSE_LIST_DATA_INTEGRITY_CHECK_BYTES to 1 now. Let’s see if it catches the issue earlier.

The system is quite complex and shutting down interrupts altogether would cause the system to halt, because it’s event-driven from the get-go.

But regarding the interrupt priorities, the Ethernet driver interrupt is set to the system’s “default” interrupt priority level, meaning that it shares the same interrupt priority with many others. There are a few timer interrupts with higher priority than that of the Ethernet driver.

I could try increasing the interrupt priority of the Ethernet driver to see if that has any effect for starters.

aggarg · April 6, 2023, 5:53am

If you try, it should be decreasing the interrupt priority and not increasing.

abelo · April 6, 2023, 6:11am

True. I was assuming that some higher priority interrupt might be pre-empting the Ethernet driver interrupts and that somehow might’ve been causing this issue. But perhaps it makes more sense that it would be the other way around.

So I’ll try decreasing the interrupt priority of the Ethernet driver.

abelo · April 6, 2023, 8:38am

Well, the system hasn’t been stable for this long with any other fix attempt so I’m very hopeful about this interrupt priority fix. Thanks a ton for the suggestion!

abelo · April 6, 2023, 9:49am

Hmm. Trying to investigate the possible reasons as to why this interrupt priority change seemed to fix the issue, I came across this definition: configMAX_SYSCALL_INTERRUPT_PRIORITY.

According to the FreeRTOSConfig.h customization guidelines, any interrupt that utilizes the interrupt safe FreeRTOS API functions, should never have an interrupt priority higher than what is set by configMAX_SYSCALL_INTERRUPT_PRIORITY. Well this was a rule we were violating big time.

In the Ethernet driver ISR callback function, I am calling vTaskNotifyGiveFromISR() function to release the RX task to handle the incoming packet. However, I’ve set the Ethernet driver interrupt priority to 2, whereas configMAX_SYSCALL_INTERRUPT_PRIORITY was set to 5, zero being the highest priority.

This callback function is provided in the Renesas FSP codebase and they do provide some guidelines on how to integrate the FreeRTOS+TCP stack with their drivers, but I think this is something they should definitely mention in their documentation on the subject here: RA Flexible Software Package Documentation: FreeRTOS+TCP Wrapper to r_ether (rm_freertos_plus_tcp)

I’ll make sure that the rest of my interrupts are respecting this rule as well and hopefully this was the root cause of this issue.

I will let you know if something else comes up but until then, thanks a ton and take care!

aggarg · April 6, 2023, 9:56am

Great - that is most likely the issue. Glad that it worked for you!

glenenglish · April 6, 2023, 8:27pm

Abel, this was excellent work ! I followed with interest.
regards,
glen