tc-cadscan wrote on Friday, January 26, 2018:
I’ve written a zero-copy network driver for FreeRTOS+TCP on STM32F4. For the most part it works well, but every once in a while I get an issue with pxGetNetworkBufferDescriptor returning NULL. I’m calling it with xBlockTimeTicks set to portMAX_DELAY, so in theory this should never happen. Looking through the source code, the buffer descriptors are kept in a list called xFreeBuffersList, and are fetched by this line:
pxReturn = ( NetworkBufferDescriptor_t * ) listGET_OWNER_OF_HEAD_ENTRY( &xFreeBuffersList );
Following this, some validation is performed on pxReturn to make sure it is a valid descriptor, and if not, return NULL. This seems to be where it’s failing:
if( ( bIsValidNetworkDescriptor( pxReturn ) != pdFALSE_UNSIGNED ) &&
listIS_CONTAINED_WITHIN( &xFreeBuffersList, &( pxReturn->xBufferListItem ) ) )
So it’s basically looking up the ListItem_t associated with the network descriptor, and verifying that it is part of xFreeBuffersList. listIS_CONTAINED_WITHIN just compares the pvContainer field of the ListItem_t to the list pointer (xFreeBuffersList in this case). The problem seems to be that pvContainer is set to NULL. As this should get set when the descriptor is inserted into the list (i.e. when it was last released), and the descriptor is definitely getting inserted into the list correctly (it must be, given that we just got it from the list), this makes me think the list structure is getting corrupted in memory somehow.
A couple of times I’ve also seen a different issue, where listGET_OWNER_OF_HEAD_ENTRY returns some random value which isn’t a buffer descriptor. Whenever this happened, I tracked it down to fields in the list structure being invalid (sometimes NULL or sometimes random values), which also suggests memory corruption to me.
There are only 5 places in my code where I call into BufferAllocation_1.c:
- During xNetworkInterfaceInitialise, I call pxGetNetworkBufferDescriptor multiple times to allocate buffers to the receive DMA descriptors.
- During xNetworkInterfaceOutput, I assign the buffer passed in by the TCP/IP stack to a transmit DMA descriptor. If this fails (due to a DMA descriptor not being available within a given timeout), I call vReleaseNetworkBufferAndDescriptor to discard the buffer.
- During the receive deferred interrupt processing task, I attempt to send the received buffer to the TCP/IP stack. If this fails, I call vReleaseNetworkBufferAndDescriptor to discard the buffer.
- Also in the deferred processing task, I call pxGetNetworkBufferWithDescriptor to replace the receive buffer that was just used.
- In the transmit complete interrupt handler. I call vNetworkBufferReleaseFromISR to release the transmit buffer.
In all of these cases, I’m calling the buffer allocation functions exactly as described in the documentation, so I don’t think this is the issue. Whenever I have to deal with a buffer in an ISR, I make sure to only call the interrupt-safe functions. Checking in BufferAllocation_1.c, all of the functions that interact with the descriptor list use ipconfigBUFFER_ALLOC_LOCK and ipconfigBUFFER_ALLOC_UNLOCK (or the interrupt-safe equivalents), so I’m not sure where the problem is happening.
I’ve written a simple test application to reproduce the issue - it basically just listens for UDP packets on port 10000 and echoes them back to the sender. I’m setting the FREERTOS_ZERO_COPY flag to reduce unnecessary buffer allocation/deallocation, to rule that out. This is the only task on the system which uses the TCP stack:
static void echoThread(void *pvParameters)
{
const uint16_t port = 10000;
Socket_t sock = FreeRTOS_socket(FREERTOS_AF_INET, FREERTOS_SOCK_DGRAM, FREERTOS_IPPROTO_UDP);
CAD_ASSERT(sock != FREERTOS_INVALID_SOCKET);
struct freertos_sockaddr hostAddr;
hostAddr.sin_port = FreeRTOS_htons(port);
FreeRTOS_bind(sock, &hostAddr, sizeof(hostAddr));
struct freertos_sockaddr clientAddr;
uint32_t clientAddrLength = sizeof(clientAddr);
uint8_t *buffer;
int32_t length;
while (1)
{
length = FreeRTOS_recvfrom(sock, &buffer, 0, FREERTOS_ZERO_COPY, &clientAddr, &clientAddrLength);
if (length > 0)
{
length = FreeRTOS_sendto(sock, buffer, length, FREERTOS_ZERO_COPY, &clientAddr, clientAddrLength);
}
if (length <= 0)
{
FreeRTOS_ReleaseUDPPayloadBuffer(buffer);
}
}
}
The issue only shows up very rarely - probably once in every 100,000 to a million packets. The only reliable way I’ve found to reproduce it is to flood the network interface with traffic, at a rate of several thousand packets/second. Even then it’s difficult to reproduce, sometimes taking several minutes. If it helps at all, I’ve found that it tends to show up most often just after starting to send packets - stopping and then restarting the packet generator is more likely to trigger it than just leaving it running.
I did find one other post on here (https://freertos.org/FreeRTOS_Support_Forum_Archive/August_2016/freertos_TCP_BufferAllocation_1_xFreeBuffersList_corruption_bab400aej.html) with a similar issue. In that case it was tracked down to stack overflow (identified by the 0xA5A5A5A5 sequence), but in my case I don’t see that sequence anywhere so I doubt it’s stack overflow. Also if it was stack overflow I’d expect it to happen a lot more frequently and less randomly than it is doing.
At this point I’m not sure how I can debug the issue any further, as the problem seems to be happening within the TCP stack itself. Any ideas?
Thanks,
Tom