Assert failed: xMBMasterRTUReceiveFSM mbrtu_m.c:235 with Modbus Master from ESP IDF

I run after some time of operation (~5-15 Minutes) into the problem, that my ESP32 restarts with the following error.

assert failed: xMBMasterRTUReceiveFSM mbrtu_m.c:235 (( eSndState == STATE_M_TX_IDLE ) || ( eSndState == STATE_M_TX_XFWR ))

The Modbus master code is pretty much the same as the example which is included in the ESP IDF, i just changed a few minor things.

This is the serial log:

I (1911922) count_task: RAM left 123264; mb_master_hwm: 1508, count_task_hwm: 20, mqtt_ssl_main_hwm: 6276
I (1912262) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1912762) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1913262) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1913762) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1914262) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1914762) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1914922) count_task: RAM left 123640; mb_master_hwm: 1508, count_task_hwm: 20, mqtt_ssl_main_hwm: 6276
I (1915262) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1915762) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1916262) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0
I (1916762) MASTER_TEST: Characteristic #0 Data_channel_0 (kW) value = 4531.000000 (0x458d9800) read successful. Restarted: 0

assert failed: xMBMasterRTUReceiveFSM mbrtu_m.c:235 (( eSndState == STATE_M_TX_IDLE ) || ( eSndState == STATE_M_TX_XFWR ))


Backtrace:0x40081c02:0x3ffd00700x40089559:0x3ffd0090 0x400909b5:0x3ffd00b0 0x400e0751:0x3ffd01d0 0x400e1545:0x3ffd0200 0x400e15b4:0x3ffd0230 0x4008cdad:0x3ffd0260
0x40081c02: panic_abort at C:/Users/Michael.Uray/esp/esp-idf/components/esp_system/panic.c:402

0x40089559: esp_system_abort at C:/Users/Michael.Uray/esp/esp-idf/components/esp_system/esp_system.c:128

0x400909b5: __assert_func at C:/Users/Michael.Uray/esp/esp-idf/components/newlib/assert.c:85

0x400e0751: xMBMasterRTUReceiveFSM at C:/Users/Michael.Uray/esp/esp-idf/components/freemodbus/modbus/rtu/mbrtu_m.c:235 (discriminator 2)

0x400e1545: usMBMasterPortSerialRxPoll at C:/Users/Michael.Uray/esp/esp-idf/components/freemodbus/port/portserial_m.c:94

0x400e15b4: vUartTask at C:/Users/Michael.Uray/esp/esp-idf/components/freemodbus/port/portserial_m.c:142

0x4008cdad: vPortTaskWrapper at C:/Users/Michael.Uray/esp/esp-idf/components/freertos/port/xtensa/port.c:131

ELF file SHA256: 375376b08aba9e70

Rebooting...
ets Jul 29 2019 12:21:46

My application runs a few tasks in parallel if that matters, which is a websocket/webserver, modbus client, modbus server and mqtt.


void app_main()
{
  const static char *TAG = "main";

  esp_err_t err = nvs_flash_init();
  if (err == ESP_ERR_NVS_NO_FREE_PAGES || err == ESP_ERR_NVS_NEW_VERSION_FOUND)
  {
    // NVS partition was truncated and needs to be erased
    // Retry nvs_flash_init
    ESP_ERROR_CHECK(nvs_flash_erase());
    err = nvs_flash_init();
  }
  ESP_ERROR_CHECK(err);

  wifi_app_main();

  led_setup();
  ws_server_start();

  xTaskCreate(&nvs_main, "nvs_main", 2000, NULL, 9, NULL);
  xTaskCreate(&server_task, "server_task", 2000, NULL, 9, NULL);
  xTaskCreate(&server_handle_task, "server_handle_task", 4000, NULL, 6, NULL);
  xTaskCreate(&count_task, "count_task", 2000, NULL, 2, NULL);
  xTaskCreate(&mb_master_main, "mb_master_main", 3500, NULL, 9, &mb_master_main_th);
  xTaskCreate(&mb_slave_main, "mb_slave_main", 3500, NULL, 9, NULL);
  xTaskCreate(&mqtt_ssl_main, "mqtt_ssl_main", 1024 * 8, NULL, 9, &mqtt_ssl_main_th);
}

What could cause this problem or how could I trace it back?

You are asking a question about modbus code that isn’t supplied by us, and it’s doubtful anybody here will be familiar with it, so we can only offer generic advice. First thing is if this is the line that is failing an assertion:

esp_system_abort at C:/Users/Michael.Uray/esp/esp-idf/components/esp_system/esp_system.c:128

What is the test that is failing (what is on that line), and is there a comment in the code there that gives you probable causes?

I thougt first this message was thrown by the FreeRTOS kernel, but this was a wrong understanding from my side.

I found in the meanwhile out, that it got triggered by the Modbus library as you mentioned:
https://github.com/espressif/esp-modbus/blob/master/freemodbus/modbus/rtu/mbrtu_m.c#L250

My guess is now, that the Modbus master and the Modbus client, which run in separate tasks, access the same resources in this library and this might lead to that problem.

After long investigations I found out, that it helped to increase the stack from the count_task to prevent this problem in the mb_master task.

In general, the count_task had about 160 bytes (ESP IDF) left according to the high watermark function and for that reason I was not worried to not have enough stack configured for it.

Before:

  xTaskCreate(&nvs_main, "nvs_main", 2000, NULL, 9, &nvs_main_th);
  xTaskCreate(&server_task, "server_task", 2000, NULL, 9, &server_task_th);
  xTaskCreate(&server_handle_task, "server_handle_task", 4000, NULL, 6, &server_handle_task_th);
  xTaskCreate(&count_task, "count_task", 2000, NULL, 2, &count_task_th);
  xTaskCreate(&mb_master_main, "mb_master_main", 3500, NULL, 9, &mb_master_main_th);
  xTaskCreate(&mb_slave_main, "mb_slave_main", 3500, NULL, 9, &mb_slave_main_th);
  xTaskCreate(&mqtt_ssl_main, "mqtt_ssl_main", 1024 * 8, NULL, 9, &mqtt_ssl_main_th);

I (109262) count_task: RAM left 126468, nvs_main_th: 392, server_task_th: 304, server_handle_task_th: 2304, count_task_th: 168, mb_master_main_th: 1512, mb_slave_main_th: 1388, mqtt_ssl_main_th: 6364

After (did increase nvs_main and server_task as well, just to make sure):

  xTaskCreate(&nvs_main, "nvs_main", 2500, NULL, 9, &nvs_main_th);
  xTaskCreate(&server_task, "server_task", 2500, NULL, 9, &server_task_th);
  xTaskCreate(&server_handle_task, "server_handle_task", 4000, NULL, 6, &server_handle_task_th);
  xTaskCreate(&count_task, "count_task", 2500, NULL, 2, &count_task_th);
  xTaskCreate(&mb_master_main, "mb_master_main", 3500, NULL, 9, &mb_master_main_th);
  xTaskCreate(&mb_slave_main, "mb_slave_main", 3500, NULL, 9, &mb_slave_main_th);
  xTaskCreate(&mqtt_ssl_main, "mqtt_ssl_main", 1024 * 8, NULL, 9, &mqtt_ssl_main_th);

I (271792) count_task: RAM left 125352, nvs_main_th: 888, server_task_th: 812, server_handle_task_th: 2392, count_task_th: 672, mb_master_main_th: 1428, mb_slave_main_th: 1480, mqtt_ssl_main_th: 6284

The counter_task is not very complex and its stack usage will proably not change very much:

static void count_task(void *pvParameters)
{
  const static char *TAG = "count_task";
  char out[100];
  int len;
  int clients;
  uint16_t n = 0;
  const int DELAY = 3000 / portTICK_PERIOD_MS; // 3 second

  ESP_LOGI(TAG, "starting task");
  for (;;)
  {
    len = sprintf(out, "Power: %.1f <br>Correction: %.1f <br>Fake Output: %.1f", PowerReal, PowerCorrection, PowerFake);
    clients = ws_server_send_text_all(out, len);

    ESP_LOGI(TAG,
      "RAM left %d, \
      nvs_main_th: %d, \
      server_task_th: %d, \
      server_handle_task_th: %d, \
      count_task_th: %d, \
      mb_master_main_th: %d, \
      mb_slave_main_th: %d, \
      mqtt_ssl_main_th: %d",
      esp_get_free_heap_size(),
      uxTaskGetStackHighWaterMark(nvs_main_th),
      uxTaskGetStackHighWaterMark(server_task_th),
      uxTaskGetStackHighWaterMark(server_handle_task_th),
      uxTaskGetStackHighWaterMark(count_task_th),
      uxTaskGetStackHighWaterMark(mb_master_main_th),
      uxTaskGetStackHighWaterMark(mb_slave_main_th),
      uxTaskGetStackHighWaterMark(mqtt_ssl_main_th)
      );

    if (clients > 0)
    {
      // ESP_LOGI(TAG,"sent: \"%s\" to %i clients",out,clients);
    }
    n++;
    vTaskDelay(DELAY);
  }
}

How can the stack size of the counter_task influence the mb_master task, especially when there were some bytes left according to the high watermark function?

One thing I note is that counter_task calls sprintf. sprintf is a complicated function, which can use a lot of stack, and possibly in ways that might not show up in stack high-water mark calls.

The problem can be that if the stack usage overflows, but that usage has a “hole” in it from an unused part of an array (which sprintf might have) that aligns with the end of allocated stack, then the high water mark might not see the overflow, as it sees “dry” memory there.

That’s exactly what I’ve observed time ago. I ended up stepping through the task code and watched the stack frame in a memory window of the debugger to tailor the stack size given to the task.
The only way to always detect stack overflows is having hardware support (stack limit registers) as introduced I think with Cortex-M33 and also supported by FreeRTOS.

So is this then a bug, that the uxTaskGetStackHighWaterMark() function does not show the real stack usage of a task?

No - it’s just a limitation of what’s possible to detect by software.