Why heap corruption when using queue?

nicolas · March 27, 2020, 1:50pm

Hi. I am working with an ESP32-Wrover-DevKit using Eclipse CDT, and the ESP-IDF framework.

I am using a single queue to collect data from multiple tasks (sensor readings). A single queue receiver will output the data through a TCP socket. Since the queue item is rather large, I decided to put only a pointer to queue items, which should be fine according to the queue documentation, as long as memory is handled correctly.

This is the data structure I am using for the queue items, note the flexible array at the end of the struct:

typedef struct mb32_packet_t {
    uint16_t preamble;
    uint8_t  system_id;
    uint8_t  message_id;
    uint8_t  reserved;
    uint16_t checksum;
    uint32_t pay_len;
    uint8_t  payload[];
} __attribute__((packed)) mb32_packet_t;

The queue declaration and definition:

#define MAX_QUEUE_SEND_ITEMS (25)

QueueHandle_t sys_link_send_queue;

sys_link_send_queue = xQueueCreate(MAX_QUEUE_SEND_ITEMS, sizeof(mb32_packet_t*));

Here’s a snippet of one of the sensor reading tasks that put items to the queue:

mb32_packet_t *packet;
uint32_t pay_len = 8;                        // payload: 8 bytes
uint32_t pac_len = sizeof(*packet)+pay_len;  // header: 11 bytes
packet = malloc(pac_len);
// ... code to assign header fields
// ... code to assign payload bytes

if(xQueueSend(sys_link_send_queue, &packet, portMAX_DELAY) != pdPASS) {
    // release allocated memory in case the queue rejected the item
    free(packet);
}

Here’s the snippet of the single receiver:

void sys_link_task(void *pvParameters) {
    while(1) {
        mb32_packet_t* packet;
        if(xQueueReceive(sys_link_send_queue, &packet, portMAX_DELAY) == pdPASS) {
            // put packet bytes on the TCP stream (blocking mode)
            tcp_server_send((uint8_t*)packet, packet->pay_len+11);
            // finally release the packet memory
            free(packet);
        } else {
            ESP_LOGE(TAG, "Failed to get message from queue.");
        }
    }
}

And finally this is the implementation of the tcp_server_send() function:

void tcp_server_send(uint8_t* buffer, size_t size) {
    // send() can return less bytes than supplied length. Walk-around for robust implementation.
    if(client_sock > 0) {
        int to_write = size;
        while(to_write > 0) {
            int written = send(client_sock, buffer+(size-to_write), to_write, 0);
            if(written < 0) {
                printf("Failed to send data [w=%d]: %d", written, errno);
                break;
            }
            to_write -= written;
        }
    }
}

Now with only one sensor task, everything is running fine. As soon as I put a second sensor task in action, I get heap corruption errors sooner or later. Sometimes it runs fine for some seconds, sometimes I immediately get these errors.

The error looks like this:

CORRUPT HEAP: multi_heap.c:288 detected at 0x3ffc75e8
abort() was called at PC 0x4008da2e on core 1

ELF file SHA256: c4fc5b20ae785f9a890274f05fd4fcfcada76b29ea16a9f736ceabbea34086ad

Backtrace: 0x400913e9:0x3ffc95c0 0x40091785:0x3ffc95e0 0x4008da2e:0x3ffc9600 0x4008dda5:0x3ffc9620 0x4008413d:0x3ffc9640 0x4008416d:0x3ffc9660 0x40093a71:0x3ffc9680 0x40094557:0x3ffc96a0 0x400f4946:0x3ffc96c0 0x400f4987:0x3ffc96e0 0x400f4b0d:0x3ffc9700 0x400f4e8e:0x3ffc9720 0x400f4ee5:0x3ffc9770 0x400e2e43:0x3ffc97a0 0x400e2f52:0x3ffc97d0 0x400d3f89:0x3ffc97f0 0x4000bd83:0x3ffc9810 0x4000182a:0x3ffc9830 0x400d5e9c:0x3ffc9850 0x400d608c:0x3ffc9880 0x40093cd1:0x3ffc98b0

CPU halted.

I then ran the xtensa-esp32-elf-gdb and looked-up the symbol at the program counter (PC):

PC 0x4008da2e -> split_if_necessary + 206 in section .iram0.text

Any idea how to solve this issue?

My thoughts:

Do I release the packet memory too early? Although the TCP socket is in blocking state as I understand (default setting). However, if the TCP socket would not be in blocking state, it would probably also not work when using a single sensor task. Therefore I guess I am doing something wrong regarding the queue itself or the memory allocation/deallocation.
I also tried to use pvPortMalloc() instead of malloc() and vPortFree() instead of free() . But no difference, same problems.

rtel · March 27, 2020, 3:31pm

I can’t see anything wrong with the way you are using the queue, and even tried it here to be sure.

This is curious because the Espressif code would not detect a corruption of the FreeRTOS heap (that allocated by pvPortMalloc() and vPortFreeRTOS), which would make me think the corruption was not caused by the memory you are allocating in the code snippet provided.

I’m afraid I don’t know anything about the TCP stack you are using, or the way it is configured, although it is possible you are freeing it too early if the stack is sending by reference (it just points to the data all the way to the code that actually sends the data onto the wire) rather than sending by copy (in which case it would take a copy of the data immediately, so you are free to free the buffer as soon as the tcp send function returns).

nicolas · March 27, 2020, 6:29pm

Hi Richard.

I can’t see anything wrong with the way you are using the queue, and even tried it here to be sure.

Thanks for confirming. That lets me focus on other parts of the code.

This is curious because the Espressif code would not detect a corruption of the FreeRTOS heap (that allocated by pvPortMalloc() and vPortFree()), which would make me think the corruption was not caused by the memory you are allocating in the code snippet provided.

The error message I posted originated from the time I used malloc() and free(), instead of pvPortMalloc() and vPortFree(). Now I get a shorter version of the error:

CORRUPT HEAP: multi_heap.c:288 detected at 0x3ffcb894
abort() was called at PC 0x4008da2e on core 0

I’m afraid I don’t know anything about the TCP stack you are using.

I am sorry not having mentioned earlier. I am using lwIP TCP/IP stack with BSD sockets. If I understand you correctly, the heap corruption could appear because the TCP send function returns before the data has been completely flushed out of network adapter; assuming the lwIP stack does not copy the data to it’s internal output buffer and is still holding only the pointer to the data I am going to release.

I have done a brief experiment by disabling the Nagle algorithm for the TCP socket:

int flag = 1;
setsockopt(server_sock, SOL_SOCKET, TCP_NODELAY, &flag, sizeof(flag));

My consideration was that turning off the Nagle algorithm will flush the data faster and therefore giving less chance to let the heap get corrupted. And indeed, the heap corruption occurs clearly less often, I am now able to run the system for several minutes.

Besides the heap corruption error I occasionally also received errno 118 from the TCP send function (118 = Not a XENIX named type file).

Do you have some knowledge about lwIP that could help me?

Thanks for your valuable help!

hs2 · March 27, 2020, 7:40pm

If you don‘t use zero copy (send) mode (assuming lwIP supports this configuration) the send buffer is copied into the internal socket buffer of the TCPIP stack and you remain the owner of the buffer and you can free it. But if you‘re using straight BSD sockets there is no zero copy API…
Further assuming lwIP is not hopelessly buggy the heap corruption might caused by other application code or by an inappropriate heap implementation.
Regarding the latter did you ensure/implement threadsafe heap access e.g. by using proper newlib with malloc hooks ? See e.g. newlibAndFreeRTOS for a pretty good explanation and description.

nicolas · April 6, 2020, 9:40pm

Hi Hartmut. Thanks for your response. I needed some time to check some code fragments and trying to systematically reduce the program to catch the bug. As I understand, lwIP does use zero copy by default; but some dev on the ESP32 forum told me, that the send() function will not return before all bytes are sent out. So calling free() after send() has returned should be safe.

I studied the link you posted. But to be honest, I am currently just too rookie to understand the background and idea of the purpose of newlib and malloc hooks; but I will definitely read more about it (although there’s a newlib configuration item in the ESP32 sdkconfig which is enabled for my program).

However, I maybe “solved” the heap corruption issue by using a fixed-size array instead of the flexible array for the payload field like so:

typedef struct mb32_packet_t {
	uint16_t preamble;
	uint8_t  system_id;
	uint8_t  message_id;
	uint8_t  reserved;
	uint16_t checksum;
	uint32_t pay_len;
	uint8_t  payload[512];
} __attribute__((packed)) mb32_packet_t;

I didn’t change anything on my malloc() and free() constellation and I am still passing pointers to the queue. Now the program runs very stable, sending hundreds of sensor values per second from different tasks.

Can you imagine that this issue was related to the flexible array construct used together with the FreeRTOS queues?

hs2 · April 7, 2020, 9:50am

BSD (socket) TCP send usually doesn’t ensure that all data is finally sent out (to the wire). Instead the data is given to the TCP stack which tries sending all data to the peer as a byte stream, receiving ACKnowledge packets from the peer, maybe doing re-transmissions in case of packet losses, etc.
It’s not that straight forward nor simple as it seems
So normal BSD send just tells you that it could put all the requested data into it’s internal socket buffer and starts transmitting it when appropriate.
True or full ‘zero-copy’ TCP send hands over the given user buffer by pointer/reference and the user must provide kind of ‘free_packet’ callback, which is invoked by the stack once the data is really sent out to the wire or a socket error occurred.
The term ‘zero-copy’ is also often used for network stacks where the TCP stack just ‘zero-copy’ forwards data (segments) to the ethernet driver, which finally sends the ethernet frames to the wire.
So the question is which kind of zero-copy is configured for lwIP ?
However, the true zero-copy (send) API is usually specific to the actual stack and is not standard BSD like.
I afraid, that using the fixed buffer allocation just reduced the probability of a heap corruption because e.g. it’s very likely that only this task / code uses exactly this (large) buffer size.
Usually allocated and later freed buffers of a certain size are stored/cached in ‘free-lists’ by common heap implementations. When calling malloc this free-lists are scanned for an already matching block to be recycled. Only if no free matching block is found, a new block is grabbed from the system and ‘formatted’ accordingly. (simplified explanation)
Hence it might help that only the task(s) dealing with fixed sized mb32_packet_t buffers accessing ‘their’ corresponding heap blocks. Or the rather large mb32_packet_t buffers are now handled in a different way/code path by your multi_heap…
Last but not least also the TCP stack as a global resource must be thread-safe and configured accordingly. But that should be the default.
All in all you might have found a good workaround for an underlying bug somewhere in your application/heap/TCP stack implementation or you just reduced the probability of the corruption.
Please tell me that it’s not an airplane control software you’re working on

nicolas · April 9, 2020, 10:30am

Hi Hartmut. Thanks a lot for your great answer! That brings lots of light into the issue; at least for me Helps me a lot to understand things more clearly.

Indeed I am working on a airplane control software; but don’t worry, it’s for short distance flights only. So as long as the program works stable for about 1-2 hours, everything is fine

Just joking - actually I am working on 3 different ESP32 projects:

WiFi <-> Serial bridge for the CrazyFlie quadcopter (already completed and works great)
WiFi <-> Parallel adapter for old 16/32-bit Commodore machines (on a good way to success)
Small crawler robot involving some sensory system (here I am struggling with the issue discussed here)

I also have an ongoing discussion on the ESP32 forum about this issue. I meanwhile raised the heap memory debugging level and got some more information.

I guess that I got enough information about the questions I had regarding the FreeRTOS queues and the lwIP stack. I now need to track down the issue in my code.

Thanks a lot for your help - that means much to me!

htibosch · April 9, 2020, 10:36am

You’re attaching nice pictures of your applications, but unfortunately they’re not visible in my Firefox browser.

Google Chrome has no problems showing them.
Thanks

nicolas · April 9, 2020, 10:57am

Thanks for the information. I uploaded them on Ubuntu with Firefox. I now merged the 3 pictures to a single image and changed the file format from jpg to png. Hope that helps.