Event Groups Concurrency Errors

tblackmon wrote on Sunday, March 30, 2014:

In xEventGroupSync (event_groups.c:182), I believe the following line has a concurrency issue:

		/* Rendezvous always clear the bits.  They will have been cleared
		already unless this is the only task in the rendezvous. */
		pxEventBits->uxEventBits &= uxBitsToWaitFor;

This line is within a vTaskSuspendAll/vTaskResumeAll block, but interrupts are still enabled.
Therefore it is possible that an interrupt could occur just before updating the uxEventBits value.
If this interrupt called xEventGroupClearBitsFromISR, then after it returns, the now stale data could overwrite the correct data.

xEventGroupWaitBits (event_groups.c:301) and xEventGroupSetBits(event_groups.c:486 and 549) also appear to have the same issue.

NOTE: I believe event_groups.c:182 also has a logical error:
pxEventBits->uxEventBits &= uxBitsToWaitFor;
should be:
pxEventBits->uxEventBits &= ~uxBitsToWaitFor;

tblackmon wrote on Sunday, March 30, 2014:

I think there could also be another much more subtle issue with the current xEventGroupSync implementation that could allow the final synchronizing task to not block while not waking the others.

Imagine a setup where xEventGroupSync was being used to synchronize 3 tasks. One of the tasks is fed by asynchronous I/O. If the I/O came in, the ISR might read the bits, then decide to clear one of them.

For example:

A set bit 0, then blocked waiting for bits [2:0]
B set bit 1, then blocked waiting for bits [2:0]
C calls xEventGroupSync to set bit 2, waiting for [2:0]
    C suspends the scheduler (line 169)
    C gets the current value of the event bits locally. (bits [1:0] are set)  (line 171)

Now an interrupt occurs clearing bit 0 with xEventGroupClearBitsFromISR.
NOTE: The interrupt will return to C since the scheduler is suspended

    C calls xEventGroupSetBits (event_groups.c:173).  This is expected to set
        the bits, make ready any blocking tasks that are waiting, and clear the
        bits if they are set to be cleared.  In this case, bit 2 will be set,
        but nothing will unblock or be cleared as only [2:1] are set now.
    C calculates uxReturn using that value read *before* the ISR hit.  This
        shows [2:0] set.
    C clears the bits and does not block.

In this situation, it would make sense if none of the tasks or all of them were to continue, but only one waking (actually, not blocking at all) is an issue.

rtel wrote on Monday, March 31, 2014:

I did reply to this post already this morning - but it seems none of the posts I made today actually resulted in the forum being updated. [EDIT: now the third attempt at posting - this time I was prepared though and copied my reply to the clipboard first!]

My reply to your first post went along the lines of:

Thanks for your efforts here. It looks like the ‘clear bits from isr’ function is doing something it aught not to do. We will consider what’s best to do about that, which may be to implement the function in the RTOS daemon task instead of directly in the ISR - so it works the same way as the ‘set bits from ISR’ function. Ref the logic error - we need to see how that has got through the testing as it should been caught in module tests.

The event groups feature is very new and I think these issues will need a maintenance release to be put out at the first opportunity.

Now onto your second post:

A couple of things here. First the above mentioned change to make the ‘clear bits from ISR’ a deferred function would actually rectify that situation anyway, however the scenario you present is not intended to be valid anyway. This is because event groups used for synchronisation are intended only ever to be cleared automatically when the sync is achieved, and not cleared manually unless a sync is aborted (due to a time out or whatever). That said, this fact is not stated in the documentation and so really you are pointing out a flaw in the documentation rather than the code. The ‘sync’ case being a special case rather than a generic case.

I do appreciate your input.


rtel wrote on Monday, March 31, 2014:

Ref the logic error - we need to see how that has got through the testing as it should been > caught in module tests.

I have looked at why this was not picked up by the tests, and the answer is that it could be argued the line is [almost] obsolete. The tests use the sync function to synchronise multiple tasks - and when this is done the call to xEventGroupSetBits() on line 173 (V8.0.0 code) has already cleared the bits before line 182 executes - therefore the tests were passing even though line 182 was not actually clearing the bits (they were already zero and line 182 just anded a value with 0 to leave them at zero). Therefore, as per the comments in the code, and assuming one event group is not used by more than one set of tasks to affect a sync, line 182 will only actually do anything useful when a task is synching with itself and no other tasks which would not make any sense (that is - it is setting a single bit then waiting for that single bit to be set - which of course it will always be because it just set it…).

None the less, it is conceivable that legitimate scenarios would exist in more dynamic systems, so the line has been corrected and a test added that actually exercises that special case.