Queue tactics for large memory objects

bremenpl · June 7, 2022, 9:52pm

Hi there,
For a long time, whenever I had to use the Queue API to manage large objects (such as logs or other long “strings”), I used the FreeRTOS queue to only store the pointers or the indexes to a global, compile time defined array. This gave me the impression that I am using the queue mechanism more efficiently, as I am only copying the pointers, instead of complete objects that can have (for example) up to 256 bytes. The disadvantage of this approach is that I had to manage the current global array pointer outside of the Queue API, i.e., had to make critical sections when incrementing the index or assigning zero to it on roll over.

I was wondering what is the approach of others, more advanced FreeRTOS users. For instance, if there is a requirement to:

have a Queue that can hold up to 16 items of 256 bytes each (the numbers only matter if we know the RAM/ ROM constraints, but my intention is to show that the managed data is big),
there are many threads that can enqueue,
there is only one thread that can dequeue,
items can be put only to the back of the queue.

What would be your optimal approach and why?
I would appreciate all feedback.

hs2 · June 7, 2022, 10:08pm

I’d also use a by-reference queue mechanism. I’m sure there are alternative approaches for a custom allocator if you’re not satisfied with your current implementation.
E.g. using a 2^N ring buffer simplifies the wrap-around handling by just masking the head and tail indices.

bremenpl · June 7, 2022, 10:19pm

Thanks for the reply. I normally use a quite simple ring buffer, in which I manually handle the enqueue part, in which I need to increment and zero the index, but rely on the Queue API to know whether the queue is full or not.

However I do this with the references approach, I don’t think I can get out from doing critical sections for index management.

richard-damon · June 7, 2022, 11:32pm

If all the buffers are exactly 256 bytes long, one solution is to make a queue to hold the “Free” pointers and a second queue to hold the queue of pointers to process. This may not be quite as efficient as your own handler for the objects, but is definitely simpler code wise.

Moriah · June 8, 2022, 3:01am

Since you have 16 slots in the queue, and a max of 256 bytes per slot, why not just use an array of arrays, where each slot is associated with a 256 char long buffer? Then all you need to queue up is the slot number that indexes into the array to point to the correct 256 char buffer? Guaranteed to work every time. If the buffers are allocated sequentially, your allocator becomes a next_slot index that gets ++ every time you enqueue, and wraps around when it overflows 16. The wrap around can be handled by an AND 0x000f with the counter thus:

// psuedo-code:
enter_critical();
this_slot = next_slot++;
next_slot &= 0x000f;
exit_critical();
enqueue(this_slot);

The above code should only generate 2 machine instructions inside the critical section, which ought to be fast enough for anybody.

When you deque a slot, it becomes available again, so as long as you only have 16 slots in the queue, you have protected yourself against running out of buffers as long as you check that the queue is not full before you allocate a buffer.

richard-damon · June 8, 2022, 3:21am

That actually isn’t safe with multiple senders that might be trying to send at the same time. Unless you queue is SMALLER than the buffer list by the number of tasks that might send, you can run out of buffers, as there will be a period between when a task gets a buffer and sends it.

Also, the buffers aren’t actually guaranteed to be used in order, as a task might get a buffer, get interrupted by another task that gets a buffer and sends it before going back to the first one.

This is where using a queue to hold the free buffers has advantages. It automatically handles the changing order and blocks tasks when the free buffer list gets exhausted (but you can still have a timeout to handle that problem).

bremenpl · June 8, 2022, 10:48am

Hi @richard-damon and @Moriah ,

Since you have 16 slots in the queue, and a max of 256 bytes per slot, why not just use an array of arrays, where each slot is associated with a 256 char long buffer?

This is exactly what I was doing so far. In the dequeue thread, I also first only peak the item and only dequeue it after processing, so it is not removed from the queue too fast. This way I don’t need to manually manage dequeueing indexes.

If all the buffers are exactly 256 bytes long, one solution is to make a queue to hold the “Free” pointers and a second queue to hold the queue of pointers to process. This may not be quite as efficient as your own handler for the objects, but is definitely simpler code wise.

This is an interesting approach I haven’t thought about. Will take a bit more heap memory though, but will definitely think it through.

That actually isn’t safe with multiple senders that might be trying to send at the same time. Unless you queue is SMALLER than the buffer list by the number of tasks that might send, you can run out of buffers, as there will be a period between when a task gets a buffer and sends it.

Also, the buffers aren’t actually guaranteed to be used in order, as a task might get a buffer, get interrupted by another task that gets a buffer and sends it before going back to the first one.

This is where using a queue to hold the free buffers has advantages. It automatically handles the changing order and blocks tasks when the free buffer list gets exhausted (but you can still have a timeout to handle that problem).

My to go tactics here is for the enqueueing thread to block in case the queue is full. Checking the queue being full is also done in a critical section. In case the queue is full the critical section is left and the thread is blocked for 1 ms. This then happens in a loop until there is space. Since there is critical section used, there should be no conflicts with other enqueueuing threads. I also always try to make the amount of items for the queue large enough to avoid the code reaching this part.

richard-damon · June 8, 2022, 11:31am

The problem is to make that work, you need the critical section to be very long, started before you test for the queue being full, and extending until you add the new block to the queue, and during that you need to actually fill the buffer. You can’t end the critical section before that, so the synchronization between buffers used and fullness of the queue can be broken.

Also, the operation of putting an item on the queue isn’t really supposed to be done inside a critical section, you might have a problem there (would need to double check how the asserts and conditions are actually stated).

You also end up with nothing to block on if the queue is full to wait for some room to be available.

Moriah · June 8, 2022, 12:08pm

The dual queue is the simplest and safest approach here, and it allows for the enqueueing task to block while waiting for a buffer. What I don’t like about it is the extra overhead of 2 queues, but it seems to solve all the problems, even the situation where after dequeuing a buffer from the main queue causes the task that acquired that buffer to hold onto that buffer for a while before giving it back to the free queue. Perhaps you could have extra buffers in the free queue in case a task needed to hold several dequeued buffers before releasing them. The hard part that it solves is that the enqueueing task originally acquires the buffer, then fills it (which might take a while, especially if it is being filled from a communications link of some sort), then enqueues it, and the deququing task takes the buffer from the queue and owns it for a while as it processes the contents, then frees it by putting it on the free queue.

RAc · June 8, 2022, 12:47pm

without having thought this all the way through, I believe you may want to look into reader-writer locks for a 1:many synchronization option. We discussed this before, eg here:

Multiple task synchronization problem - Kernel - FreeRTOS Community Forums

bremenpl · June 8, 2022, 1:09pm

Thanks for the further feedback @Moriah and @RAc ,
I fully agree with you about the long critical section- this sort of starts to kill the whole concept and I am starting to wonder whether copying with memcpy would not be faster in the end.

The nature of this design is that I mind for the enqueue part to be as fast as possible, while I don’t care that much how long will the deque part take.

In the end I think that the dual queue (one for free indexes and one for taken indexes) could be easiest to implement. I understand that in this case the multiple threads would dequeue from the free indexes queue and enqueue the taken indexes queue. The single receiver thread would queue and dequeue in reverse fashion- is that right?

richard-damon · June 8, 2022, 2:56pm

Yes, to send a message, a task takes a buffer from the free queue, fills it with its message, then sends it. It can take the buffer early so it uses the buffer for its original building of the message so it doesn’t need to copy.

Then the receive gets a message from the work queue, processes it, and when done puts the empty buffer back on the free queue.

The whole thing needs NO critical sections in user code, just the ones already in the Queue code.

RAc · June 8, 2022, 3:15pm

Of course Richard is correct - mutual exclusion with regard to buffer indices is already taken care of by the queuE implementation.

By prioritizing your enqueuers higher than the dequeuer, you can take care of your priorization, but of course you need to be prepared for a failing shock absorber. I believe that has been addressed before. It will all work, but most likely you will encounter an unexpected timing behavior, such as your enqueuers filing your queue until the queue is full (all of them blocking), then your dequeuer removing exactly one element until the next enqueuer gets woken and so on (saw tooth pattern).

bremenpl · June 8, 2022, 3:16pm

@richard-damon this solutions seems spot on indeed.

It can take the buffer early so it uses the buffer for its original building of the message so it doesn’t need to copy.

Exactly…

Will test it out, thanks.

bremenpl · June 8, 2022, 3:16pm

@RAc Thank for the input. I will need to digest it though.

RAc · June 8, 2022, 3:23pm

Sorry if I was too concise - I meant to reformulate basic concurrency - if you have data coming in faster than you can process it, you do not benefit from a queue because your system will oscillate between full and full-1. A queue is a shock absorber in the sense that if ON THE AVERAGE, you can serve the data in-data out ratio, the queue will absorbe peaks, but can not make a funnel wider.

I hope that makes more sense?

Moriah · June 8, 2022, 4:17pm

Unless you have an overriding reason to do otherwise, I would run all the queue accessing tasks here at the same priority and use round robin time slicing scheduling.

bremenpl · June 8, 2022, 8:22pm

Thanks for the additional feedback @RAc and @Moriah ,
My case is a simple singleton logger module. Many threads can enqueue, only one thread dequeues and prints the logs through a peripheral.

A queue is a shock absorber in the sense that if ON THE AVERAGE, you can serve the data in-data out ratio, the queue will absorbe peaks, but can not make a funnel wider.

It will happen that there will be periods with more and less frequent longs enqueuing, that’s why I always used queue approach for such designs. In case I will find out that the queue is full too often, I can make it wider. I normally don’t override the oldest queue items in case of a full queue as I would loose the logs.

Originally when posting this, I was just trying to learn the most efficient way for designing such a queue that would yield good balance between speed and memory usage. The dual queue approach for free and taken indexes sounds good to me.

richard-damon · June 9, 2022, 12:08am

I find that rarely does making all tasks the same priority actually make sense. and especially if the reasoning is that they all post to the same queue. Priority is your tool for allocating CPU resources to try to achieve on-time results.

Moriah · June 9, 2022, 12:35am

Other than inside time critical device handlers and their associated task level code, this is only a problem when you have already stressed the capabilities of your system. Good system design determines how much time each process requires to do a meaningful increment of work, and if you haven’t allocated enough time overall to meet that, then your system is underdesigned.

A priority scheme only really serves to get important things doen first because if they don’t get done quick enough, then the system fails.

Even with priorities, the idle task should still have plenty of time to run every system time slice, whatever that time is for your system. The queue backup problem described earlier is a symptom that there just isn’t enough time to get everything done.