Seeking Advice on ESP32 Software Architecture: ITC, Error Handling & Defensive Coding for a Self-Healing System

curiousman · May 30, 2025, 3:12am

Hello everyone,

I’m working on an ESP32 project and would appreciate some insights on software architecture, particularly regarding inter-task communication (ITC) and robust error handling. I considered posting on the ESP-IDF forum, but I thought I might get broader architectural perspectives here, as my questions are more about general design patterns for a multi-tasking system.

I’ve started development, but the code isn’t quite in a shareable state yet, and I believe my questions are more conceptual at this stage, focusing on the overall software architecture rather than a specific bug.

Here’s an overview of the system I’m aiming for:

Goal: To build a critical system that is as self-healing as possible.
Components & Task Structure: The system involves several components, each running as its own FreeRTOS task:
- Mesh Wi-Fi task
- SoftAP with a basic HTTPS server task
- MQTT client task
- Buttons (input handling) task
- Relays (output control) task
- LCD display task
- A dedicated error_handler task/module.
Component API: Each component task will expose public functions for its lifecycle management (init(), start(), stop(), update(), deinitialize()). Crucially, each component will also have its own specific public operational functions (e.g., mqtt_publish_message(), relay_set_state(), lcd_display_text(), etc.) that define its core functionality.
Error Propagation: Each component is designed to report errors encountered within its operations to the calling function/module.
Error Handling and Recovery Mechanism: I’m utilizing an ERROR_CHECK macro. When a component’s function (especially lifecycle functions like init, start, stop, etc.) returns an error code, this macro captures it. The error code is then fed into a state machine (managed in what I’m calling main_callback.c and main_polling.c, which act as central supervisor/management logic). This state machine orchestrates pre-defined recovery scenarios based on the specific error received. These recovery actions can range from stopping/restarting the problematic component task to a full system reset.
Centralized Management: The lifecycle functions (init, start, stop, update, deinitialize) for each component task are primarily invoked from these central main_callback.c / main_polling.c contexts. These central modules are responsible for managing the state and lifecycle of the individual component tasks.
Resource Protection: Each component task will access its internal global data structures using mutexes, waiting with portMAX_DELAY when acquiring them.
Mutex Discipline: All functions and critical sections requiring a mutex are designed to strictly acquire and release it.
Inter-Component Data Exchange (Current thought): Component tasks that need to interact with each other would do so via public getter and setter functions (which themselves would handle necessary mutexing if accessing shared data internal to that component).

If you have any questions about the structure I’ve described so far, please feel free to ask, and I’ll do my best to clarify.

Now, for my main questions:

I’ve read the documentation and understand there are several methods for ITC in an ESP-IDF/FreeRTOS environment. I’m currently undecided between these three approaches for managing interaction and data flow between my component tasks:

Polling with Mutexes & Getters/Setters: Tasks would periodically check shared data (protected by mutexes) exposed via getter/setter functions from other tasks.
esp_event: Leveraging ESP-IDF’s system event loop library for a more event-driven approach between tasks.
FreeRTOS Queues: Using standard FreeRTOS queues for message passing and data transfer directly between tasks.

My Core Dilemma: Which of these methods (or perhaps a combination) do you believe would best provide concurrency safety (thread-safety), liveness (ensuring the system makes progress and doesn’t deadlock), and timeliness (meeting operational deadlines) for a system with multiple interacting tasks like mine?

I have a general understanding of the pros and cons of each. I suspect many might lean towards FreeRTOS Queues as a robust solution for direct task-to-task communication, but I’m slightly concerned it might be over-engineering for all interactions, or perhaps I’m underestimating the complexity where queues would be highly beneficial.

Seeking Your Experience:

What ITC mechanisms do you typically use in your embedded projects (especially on ESP32/FreeRTOS with multiple tasks), and what are your main reasons for choosing them in different scenarios?
How do you generally approach error handling and recovery in your embedded systems to achieve robustness or self-healing capabilities, especially when dealing with errors originating from different independent tasks?
If you’re willing to share, even a high-level flowchart or a conceptual description of your typical system architecture for similar multi-tasking projects would be immensely helpful.
Return Code Vigilance: How rigorously should I check the return codes of every ESP-IDF and FreeRTOS function? Are there situations where, based on documentation or common practice, certain successful returns can be more implicitly trusted, or should every call be wrapped in an error check?
Data Integrity from APIs: When an ESP-IDF or FreeRTOS function successfully returns data (e.g., in a struct, via a pointer parameter), how much validation should I perform on that data? For instance, if xQueueReceive successfully returns an item, can I generally trust the contents of that item (assuming my sending code is correct), or are there common pitfalls or edge cases where the received data might still be problematic?
Pointer Safety: If a library function (e.g., one that allocates a resource and returns a handle/pointer) indicates success, how safe is it to assume the returned pointer is valid and non-NULL? Should I still add explicit NULL checks as a best practice?
General Philosophy: What’s your general philosophy or rule of thumb for balancing robust, defensive code against code readability/conciseness and potential (though often minor) performance overhead when interacting with these well-established libraries? Are there specific types of API calls you’re always extra cautious with?

I understand that “it depends” is often the answer, but I’m keen to hear about your experiences, common practices, and any specific examples or guidelines you follow to strike a good balance.

Thanks in advance for your time and insights!

curiousman · May 30, 2025, 2:44pm

It’s been 12 hours and no response. Interesting. Is my questions not good enough?

richard-damon · May 31, 2025, 6:17pm

First, as you say at the end, “It depends” will be a big part of the answer, and being a sort of “vague” quesiton, won’t be high on peoples proiorities to deal with. I will begin by saying I have not used ESP32 processors.

My first observation is that because of your dependency on a structure like main_callback/main_polling, it seems you are not thinking of the system in terms of a Real Time system with interrelated operational centers, but as a single integrated monolith which has been divided into functional centers. I personally don’t start with a division into tasks, but a division into operational centers, and then figure if (and perhaps how many) “tasks” or other resources might be needed to implement those centers. Forinstance, “buttons” that can be directly read normally don’t require a “task”, but are perhaps interrupt triggered or sampled with a timer callback, that feed button events on a queue to a system-state/user-interface task.

As far as “self-healing”, that seems best handled close to the source, where you know more details about the situation, and not sent back to some “central control” which has much more limited information and options (being limited by the ability of the API to report the condition and implement some sort of recovery. At that level the choices are mostly limited to try to limp on ignoring the error, reboot to clear the error, or just punt and report a problem to the user and let them choose.

Within a component, its INTERNAL state may often not need to be protected, unless the component has asyncronous parts. The external facing getters/setters may need that protection, and thus parts using that information need it, but the external API getter/setter shouldn’t be for the full internal state. That is a sign that you didn’t partition the system correctly.

As far as interactions, Polling with getters/setters should not be done to determine IF something is to be done, but once the decision to act, might be appropriate to implement the action or to determine WHAT is to be done. Modules should be relaitvely independent of each other, with well defined communicatons channels.

One other comment, use of portMAX_DELAY is generally NOT adviced, unless you literally are saying I am done here until somebody puts data on this queue, so I will wait till then. All timeout that you are expecting to be just for a finite time should have a reasonable time limit, and the operation checked, and if it timed out, appropriately handle the condition. Infinite timeout just hide dead-lock as a stuck system, and not let you know what is happening. Failing on a time out at least lets you know when the problem happened.

Error returns: as I said, device failures are best handled locally in the device if there is something that can be “healed”. At worse the module cleans up to be in a stable condition and reports failure to its caller, but there is rarely any “healing” that the caller can do,unless the issue is some larger scale situation that you really can’t control (someone forgot to plug in something).

I do check all the return codes, but then most of that handling is to handle possible timeouts, where assuming completion is just an error.

As to data-validation, data sent over a trustworthy medium is a trustworthy as its sender. Operations tend to be programmed defensively to determine that I won’t do something stupid in processing it, but you need to define how well each part should trust the other parts.