xBufferSizeBytes++ in xStreamBufferGenericCreate leads to non optimal code

Hi

the “quirk” as the code comment says to increase the xBufferSizeBytes by one leads to non-optimal code. The xBufferSizeBytes is then assigned to Length which is then 1 by greater than requested. If we are now for example doing lots of 32bit or 64bit or similar 4 or 8 bytes aligned writes with prvWriteBytesToBuffer then the unfortunate extra byte destroys all alignment - assuming that requested size was multiple of 4 or 8. This has the consequence that any optimization memcpy implementations typical do for 4 or 8 byte aligned addresses is destroyed. This is not only at time when the buffer head wraps over, but also in the consecutive calls.

I do not understand why this increment has been done in the first place, but if there is a strong reason to do it, wouldn’t it be better to at least keep Length correct by supplying prvInitialiseNewStreamBuffer the right number?

Otherwise, great thanks for all this work!

Cheers
Rolf

Hi Rolf,

I don’t think adding a byte to the buffer changes its alignment. Would that alleviate your concern about memcpy optimizations?

A common characteristic of buffers like this (stream buffer, “ring” buffer, byte queue, etc) is that its capacity is one byte less than the buffer allocated for it – just a result of a common implementation choice. So adding one to the user’s requested size allows the buffer to actually hold that many bytes of user data. You might say the code that adds one to the user’s requested size is avoiding a quirk.

Ok, let’s assume I have created a stream buff with requested length of 0x40 which internally got increased to 0x41. Then let’s assume xNextHead is at 0x40 resulting fro a couple of 8 byte writes.

Now, I am trying to write another 8 bytes. Eventually is reaches the code below.

xFirstLength = configMIN( pxStreamBuffer->xLength - xNextHead, xCount );

will evaluate to 1 from min(0x41-0x40, 8)
This leads to a memcpy of 1 byte.

Since we have a left over of 7 bytes, the second memcpy will access &pucData[1] unaligned to write into &( pxStreamBuffer->pucBuffer[ 0 ] ) which is aligned but doesn’t help much since the source isn’t.

As a result xNextHead will be 7.

This leads to the situation that in the next call of prvWriteBytesToBuffer we will read 8 bytes aligned from pucData but will write unaligned to &( pxStreamBuffer->pucBuffer[ 7 ] )

I verified this scenario with a debugger. It came across to me when using actually a buggy memcpy implementation (which isn’t the fault freertos of course).

But while doing this, this odd wrap around really destroys all memcpy SIMD and similar optimizations.

Further, I still no understand why this extra byte is added? I have done circular buffers many time with exclusive read access from one context and write from another. If you like to have e.g. a 1024 bytes ring buffer to feed like 16 byte chunks in high speed, I have no idea for what use byte 1025 really is.

Anyway, thanks for replay!
Best
Rolf

======================

static size_t prvWriteBytesToBuffer( StreamBuffer_t * const pxStreamBuffer, const uint8_t *pucData, size_t xCount )

{

size_t xNextHead, xFirstLength;

configASSERT( xCount > ( size_t ) 0 );

xNextHead = pxStreamBuffer->xHead;

/* Calculate the number of bytes that can be added in the first write -

which may be less than the total number of bytes that need to be added if

the buffer will wrap back to the beginning. */

xFirstLength = configMIN( pxStreamBuffer->xLength - xNextHead, xCount );

/* Write as many bytes as can be written in the first write. */

configASSERT( ( xNextHead + xFirstLength ) <= pxStreamBuffer->xLength );

( void ) memcpy( ( void* ) ( &( pxStreamBuffer->pucBuffer[ xNextHead ] ) ), ( const void * ) pucData, xFirstLength ); /*lint !e9087 memcpy() requires void *. */

/* If the number of bytes written was less than the number that could be

written in the first write… */

if( xCount > xFirstLength )

{

/* …then write the remaining bytes to the start of the buffer. */

configASSERT( ( xCount - xFirstLength ) <= pxStreamBuffer->xLength );

( void ) memcpy( ( void * ) pxStreamBuffer->pucBuffer, ( const void * ) &( pucData[ xFirstLength ] ), xCount - xFirstLength ); /*lint !e9087 memcpy() requires void *. */

}

else

{

mtCOVERAGE_TEST_MARKER();

}

xNextHead += xCount;

if( xNextHead >= pxStreamBuffer->xLength )

{

xNextHead -= pxStreamBuffer->xLength;

}

else

{

mtCOVERAGE_TEST_MARKER();

}

pxStreamBuffer->xHead = xNextHead;

return xCount;

}

You’re right – if you are consistently inserting only native words into the buffer, and consistently extracting only native words, and if the streambuffer’s internal buffer is both word aligned and multi-word sized, and if the source and destination buffers in the application are word aligned, then you can get the benefit of memcpy optimizations. That’s a lot of "and if"s, but I see where you’re coming from now. Since a streambuffer hides its buffer, it’s best not to depend on its alignment (or size).

Out of curiosity, what platform are you using?

The unused byte in the underlying buffer is common side effect of not maintaining a count field. When head == tail, the buffer is empty. But if you maintain a count field, when head == tail, the buffer might be empty or full.

Hi

thanks for these explanations!

I am on Zynq for this.

In many application areas writing multiples of 4 or 8 byte data to an equally aligned buffer is the default 90% of the time. I would be believe for arbitrary data lengths the message buffer would be a better approach anyhow.

I see the point regarding full vs empty. Typically I would either have a flag for this or avoid the situation “full” at all by preventing writes which would fill the buffer completely.

Since these are edge cases to be avoided in normal operations anyhow by adding some extra headroom, I believe introducing a fag or special treatment for these would justify this.

Many mallocs would round up the extra byte to 4 bytes anyhow which then could be used explicitly as a count field instead.

But no problem, I can workaround the current coding by just asking for 1023 bytes when I really need 1024.

Thanks
Rolf

Well, if you really NEED 1024 bytes, and want things on multiple of 8 borders, ask for 1024+7 bytes, as an allocation of 1024 bytes really can only hold 1023 bytes, since 1 byte is always unused to distinguish between full and empty.