BufferAllocation_1.c xFreeBuffersList memory corruption issue

tc-cadscan wrote on Friday, January 26, 2018:

I’ve written a zero-copy network driver for FreeRTOS+TCP on STM32F4. For the most part it works well, but every once in a while I get an issue with pxGetNetworkBufferDescriptor returning NULL. I’m calling it with xBlockTimeTicks set to portMAX_DELAY, so in theory this should never happen. Looking through the source code, the buffer descriptors are kept in a list called xFreeBuffersList, and are fetched by this line:

pxReturn = ( NetworkBufferDescriptor_t * ) listGET_OWNER_OF_HEAD_ENTRY( &xFreeBuffersList );

Following this, some validation is performed on pxReturn to make sure it is a valid descriptor, and if not, return NULL. This seems to be where it’s failing:

if( ( bIsValidNetworkDescriptor( pxReturn ) != pdFALSE_UNSIGNED ) &&
					listIS_CONTAINED_WITHIN( &xFreeBuffersList, &( pxReturn->xBufferListItem ) ) )

So it’s basically looking up the ListItem_t associated with the network descriptor, and verifying that it is part of xFreeBuffersList. listIS_CONTAINED_WITHIN just compares the pvContainer field of the ListItem_t to the list pointer (xFreeBuffersList in this case). The problem seems to be that pvContainer is set to NULL. As this should get set when the descriptor is inserted into the list (i.e. when it was last released), and the descriptor is definitely getting inserted into the list correctly (it must be, given that we just got it from the list), this makes me think the list structure is getting corrupted in memory somehow.

A couple of times I’ve also seen a different issue, where listGET_OWNER_OF_HEAD_ENTRY returns some random value which isn’t a buffer descriptor. Whenever this happened, I tracked it down to fields in the list structure being invalid (sometimes NULL or sometimes random values), which also suggests memory corruption to me.

There are only 5 places in my code where I call into BufferAllocation_1.c:

  • During xNetworkInterfaceInitialise, I call pxGetNetworkBufferDescriptor multiple times to allocate buffers to the receive DMA descriptors.
  • During xNetworkInterfaceOutput, I assign the buffer passed in by the TCP/IP stack to a transmit DMA descriptor. If this fails (due to a DMA descriptor not being available within a given timeout), I call vReleaseNetworkBufferAndDescriptor to discard the buffer.
  • During the receive deferred interrupt processing task, I attempt to send the received buffer to the TCP/IP stack. If this fails, I call vReleaseNetworkBufferAndDescriptor to discard the buffer.
  • Also in the deferred processing task, I call pxGetNetworkBufferWithDescriptor to replace the receive buffer that was just used.
  • In the transmit complete interrupt handler. I call vNetworkBufferReleaseFromISR to release the transmit buffer.

In all of these cases, I’m calling the buffer allocation functions exactly as described in the documentation, so I don’t think this is the issue. Whenever I have to deal with a buffer in an ISR, I make sure to only call the interrupt-safe functions. Checking in BufferAllocation_1.c, all of the functions that interact with the descriptor list use ipconfigBUFFER_ALLOC_LOCK and ipconfigBUFFER_ALLOC_UNLOCK (or the interrupt-safe equivalents), so I’m not sure where the problem is happening.

I’ve written a simple test application to reproduce the issue - it basically just listens for UDP packets on port 10000 and echoes them back to the sender. I’m setting the FREERTOS_ZERO_COPY flag to reduce unnecessary buffer allocation/deallocation, to rule that out. This is the only task on the system which uses the TCP stack:

static void echoThread(void *pvParameters)
{
	const uint16_t port = 10000;
	
	Socket_t sock = FreeRTOS_socket(FREERTOS_AF_INET, FREERTOS_SOCK_DGRAM, FREERTOS_IPPROTO_UDP);
	CAD_ASSERT(sock != FREERTOS_INVALID_SOCKET);
	
	struct freertos_sockaddr hostAddr;
	hostAddr.sin_port = FreeRTOS_htons(port);
	FreeRTOS_bind(sock, &hostAddr, sizeof(hostAddr));
	
	struct freertos_sockaddr clientAddr;
	uint32_t clientAddrLength = sizeof(clientAddr);
	uint8_t *buffer;
	int32_t length;
	while (1)
	{
		length = FreeRTOS_recvfrom(sock, &buffer, 0, FREERTOS_ZERO_COPY, &clientAddr, &clientAddrLength);
		if (length > 0)
		{
			length = FreeRTOS_sendto(sock, buffer, length, FREERTOS_ZERO_COPY, &clientAddr, clientAddrLength);
		}
		
		if (length <= 0)
		{
			FreeRTOS_ReleaseUDPPayloadBuffer(buffer);
		}
	}
}

The issue only shows up very rarely - probably once in every 100,000 to a million packets. The only reliable way I’ve found to reproduce it is to flood the network interface with traffic, at a rate of several thousand packets/second. Even then it’s difficult to reproduce, sometimes taking several minutes. If it helps at all, I’ve found that it tends to show up most often just after starting to send packets - stopping and then restarting the packet generator is more likely to trigger it than just leaving it running.

I did find one other post on here (https://freertos.org/FreeRTOS_Support_Forum_Archive/August_2016/freertos_TCP_BufferAllocation_1_xFreeBuffersList_corruption_bab400aej.html) with a similar issue. In that case it was tracked down to stack overflow (identified by the 0xA5A5A5A5 sequence), but in my case I don’t see that sequence anywhere so I doubt it’s stack overflow. Also if it was stack overflow I’d expect it to happen a lot more frequently and less randomly than it is doing.

At this point I’m not sure how I can debug the issue any further, as the problem seems to be happening within the TCP stack itself. Any ideas?

Thanks,
Tom

rtel wrote on Monday, January 29, 2018:

Hmm, tricky to see in this as you seem to have done a good level of debugging already.

Is the driver code all yours? Or are you also relying on some STM32 library/cube functions? If you are using STM32 library functions, do any of them mess with the basepri register? That could potentially effect critical sections.

Also, how are you passing received buffers into the TCP/IP stack? Is it from an interrupt? If so, are you then trying to allocated a new buffer from the interrupt too?

Are you using a recent version of FreeRTOS with configASSERT() defined? FreeRTOS V10.0.1 would be recommended as that has asserts for just about all the interrupt priority configuration errors people make.

tc-cadscan wrote on Monday, January 29, 2018:

We’re not using any of the STM32 library code, all the driver code has been written from scratch. Nothing in my code touches BASEPRI.

We’re using FreeRTOS v9 as we started work on the project just before v10 was released and haven’t got round to upgrading yet, but the TCP/IP stack code was taken from the v10 release. configASSERT is defined but isn’t failing anywhere.

Receive buffers get passed back to the stack from a deferred processing task, not from the interrupt. The only buffer management that happens in an ISR is releasing the transmit buffers after the transmit complete interrupt. One thing I’ve just noticed - vNetworkBufferReleaseFromISR calls vListInsertEnd to put the buffer back into the free buffers list. Is this safe to call from an ISR? (It’s wrapped inside portSET_INTERRUPT_MASK_FROM_ISR/portCLEAR_INTERRUPT_MASK_FROM_ISR, but I’m not sure if this is adequate to make a non-interrupt safe function interrupt safe).

I’ll try releasing the transmit buffers from the deferred processing task instead of directly from the interrupt, and see if that makes any difference.

rtel wrote on Monday, January 29, 2018:

One thing I’ve just noticed - vNetworkBufferReleaseFromISR
calls vListInsertEnd to put the buffer back into the free buffers list.
Is this safe to call from an ISR? (It’s wrapped inside
portSET_INTERRUPT_MASK_FROM_ISR/portCLEAR_INTERRUPT_MASK_FROM_ISR, but
I’m not sure if this is adequate to make a non-interrupt safe function
interrupt safe).

That should be ok.

I’ll try releasing the transmit buffers from the deferred processing
task instead of directly from the interrupt, and see if that makes any
difference.

Let us know if that solves the problem.

heinbali01 wrote on Monday, January 29, 2018:

Hi Tom,

I’ve written a zero-copy network driver for FreeRTOS+TCP on STM32F4

That is very brave of you, but there was a zero-copy driver already. It is not in the official distribution yet, but it was published in this list. I will post it as an attachment here below (STM32Fxx.zip).

but every once in a while I get an issue with pxGetNetworkBufferDescriptor returning NULL.

I solved this problem by dropping a packet. When a packet has been received, then you need to provide a new empty network buffer. If that is impossible, I would drop the received packet, and leave the buffer in place so it will be used to receive a next package.

I’m calling it with xBlockTimeTicks set to portMAX_DELAY

Are you sure it is wise such a long time? Does this happen from prvEMACHandlerTask() ?

When the function is called from the IP-task ( e.g. from xNetworkInterfaceOutput() ), you should use a time-out of zero. If not, the IP-task will probably be waiting for the IP-task to release some network buffer.

During xNetworkInterfaceInitialise, I call pxGetNetworkBufferDescriptor multiple
times to allocate buffers to the receive DMA descriptors.

Comment : during initialisation, there must be enough network buffers. You can use configASSERT() for this. If it is triggered, the is a configuration error.

During xNetworkInterfaceOutput, I assign the buffer passed in by the TCP/IP stack
to a transmit DMA descriptor. If this fails (due to a DMA descriptor not being
available within a given timeout), I call vReleaseNetworkBufferAndDescriptor to
discard the buffer.

This function has lead to some confusion in the past.
To summarise:

If ipconfigZERO_COPY_TX_DRIVER = 1, the parameter bReleaseAfterSend is always true
If bReleaseAfterSend is true, it means that the driver will own the pxDescriptor.
So yes, if sending fails, you must release the network buffer.
If sending succeeds, the releasing of the buffer is suspended.
In my zero-copy driver, the releasing is indeed done within prvEMACHandlerTask().

I stop commenting here.

Please have a look at the attached driver. It has been tested on both STM32F4 and STM32F7.

Regards.

tc-cadscan wrote on Tuesday, January 30, 2018:

That is very brave of you, but there was a zero-copy driver already. It is not in the official distribution yet, but it was published in this list. I will post it as an attachment here below (STM32Fxx.zip).

Thanks Hein, I’ll take a look at that and see how it differs from my code. From a quick glance, it seems like the buffer management is mostly the same, but I’m doing some initial filtering in the ISR to reduce context switches from dropped packets.

I solved this problem by dropping a packet. When a packet has been received, then you need to provide a new empty network buffer. If that is impossible, I would drop the received packet, and leave the buffer in place so it will be used to receive a next package.

I don’t think this will solve my issue, which I think is down to the buffer list becoming corrupted, rather than a timeout (as noted, I’m calling pxGetNetworkBufferAndDescriptor with portMAX_DELAY). Once it gets in this state, every call to pxGetNetworkBufferAndDescriptor returns NULL immediately (i.e. not due to timeout), so without fixing the underlying issue this would just cause the receive process to drop packets forever.

Are you sure it is wise such a long time? Does this happen from prvEMACHandlerTask() ?

When the function is called from the IP-task ( e.g. from xNetworkInterfaceOutput() ), you should use a time-out of zero. If not, the IP-task will probably be waiting for the IP-task to release some network buffer.

Yes, this is only done from the receive task, never from anything called by the IP task. I don’t ever allocate buffers in xNetworkInterfaceOutput, I only use the buffer that’s passed in as a parameter.

My reason for using an infinite block time is that, once the receive buffer has been given back to the IP task, the new buffer must be allocated to the receive DMA, regardless of how long it takes. Otherwise the receive DMA will be gradually depleted of buffers. However I agree with you that it’s better to try to get the replacement buffer first, and if that fails then drop the packet and reuse the existing buffer - I’ll change my code to do that.

heinbali01 wrote on Tuesday, January 30, 2018:

Hi Tom,

but I’m doing some initial filtering in the ISR to reduce context switches from dropped packets.

Maybe you want to share some code?

I don’t think this will solve my issue, which I think is down to the buffer list becoming corrupted

I don’t know where the corruption comes from, but I doubt that there is a problem with BufferAllocation_1.c.

My reason for using an infinite block time is that, once the receive buffer has been
given back to the IP task, the new buffer must be allocated to the receive DMA,
regardless of how long it takes.

Here I do not agree. My choice was to drop the packet received ( do not forward it to the IP-task ), and leave the buffer in places.

Otherwise the receive DMA will be gradually depleted of buffers.

True, that is why I would not swap the buffer.

I’ll change my code to do that.

Yes good. But please check if it really happens and how often.

Regards, Hein