UDP zero copy sendto outside of allocated stack?

huangwm wrote on Wednesday, September 18, 2019:

I’m inheriting a camera-based project that uses a Xilinx Zynq-7000 system. Video frame buffers are captured and placed in DDR memory by one of the cortex cores. The second core uses FreeRTOS+TCP/UDP (v9.0.0) where both TCP and UDP are used. The UDP socket is mainly used to send video frame data (about 1.5MiB) using standard calling semantics sendto() (zero copy flag not set), whereas the TCP socket is mainly used to communicate commands between external systems. It’s also using BufferAllocation_1.c for buffer management and the NetworkInterface port is similar to the FreeRTOS example.

I would like to switch to using zero copy calling semantics, however my problem was that I could not use the pre-allocated TCP stack as my video data is in a different memory location. Is it possible to somehow pass in the address of where the video data is stored to the sendto() API without use the pre-allocated stack?


huangwm wrote on Wednesday, September 18, 2019:

For additional info:
I tried the following:

bytes_sent = FreeRTOS_sendto(xServerSocket, videoBuffer, totalSize, FREERTOS_ZERO_COPY, &xSourceAddress, xSize);

However, when the a NULL network buffer is returned when the sendto() function converts the buffer to a network buffer using

NetworkBufferDescriptor_t *pxUDPPayloadBuffer_to_NetworkBuffer( void *pvBuffer )
    pxResult = * ( ( NetworkBufferDescriptor_t ** ) pucBuffer ); // <--- dereferenced to NULL here

There’s no alignment issue.

The other approach that I tried (and works, but not quite happy with it) is the following:

                    pNetworkBufferDescriptor = pxGetNetworkBufferWithDescriptor(totalSize, portMAX_DELAY);
                    if(pNetworkBufferDescriptor != NULL)
                        pNetworkBufferDescriptor->pucTempEthernetBuffer = pNetworkBufferDescriptor->pucEthernetBuffer;  // Store the previous Ethernet buffer address
                        pNetworkBufferDescriptor->pucEthernetBuffer = pEthernetBuffer;                                  // Point to new Ethernet buffer address

                        // Store network buffer descriptor ...
                        void *pNetworkBuffer = pEthernetBuffer; // Point to head of the Ethernet buffer
                        pNetworkBuffer -= ipBUFFER_PADDING;     // Move back ipBUFFER_PADDING for space to store the network buffer descriptor
                        * ((unsigned *) pNetworkBuffer) = (unsigned) (pNetworkBufferDescriptor);  // Copy the network buffer descriptor

                        bytesSent = FreeRTOS_sendto(xServerSocket, vptrVideoBuffer, totalSize, FREERTOS_ZERO_COPY | FREERTOS_ZERO_COPY_BUFSWAP, &xSourceAddress, xSize);

As you can see, I had to modify the FreeRTOS+TCP library a little bit to make this work.

rtel wrote on Wednesday, September 18, 2019:

Let me check I understand - you want to use zero copy but allocate the
buffers yourself? Is that correct? If so, who will ‘own’ the buffer -
more specifically, which software is responsible for freeing the buffer
afterwards? The buffer allocation is part of the port layer, so you can
provide your own which returns the address of the pre-allocated buffer
you want to use, but then you also need to be able to free it.
Alternatively can you use the stack to allocate the payload buffer and
use that as your camera buffer?

heinbali01 wrote on Thursday, September 19, 2019:

What you want is certainly possible, I think, as long as you use the UDP protocol. You will need to make some modification to 2 network buffer functions.

I would recommend start using BufferAllocation_2.c because it is more flexible regarding the buffer allocation. You can have buffers allocated with pvPortMalloc(), while other Network Buffers could point to the shared video memory.

You will need a function that determines the origin of the memory, something like:

	BaseType_t xIsVideoMemory( uint32_t ulAddress )
		return ( ulAddress >= VIDEO_START ) && ( ulAddress < VIDEO_START + VIDEO_LENGTH )

Now when releasing a buffer, the driver knows exactly how to release the buffer. In case of a video buffer, it will be passed to the video processor / core.

	if( xIsVideoMemory( ( uint32_t ) pxDescriptor->pucEthernetBuffer ) )
		release_video_memory( pxDescriptor->pucEthernetBuffer - ipBUFFER_PADDING );
		vPortFree( pxDescriptor->pucEthernetBuffer - ipBUFFER_PADDING );

As you have seen, each NetworkBufferDescriptor_t has a pointer to the actual Ethernet data, called pucEthernetBuffer.
This pointer always points to the first byte of the network packet, i.e. the Ethernet header:

EthernetHeader_t xEthernetHeader; /*  0 + 14 = 14 */
IPHeader_t xIPHeader;             /* 14 + 20 = 34 */
UDPHeader_t xUDPHeader;           /* 34 +  8 = 42 */
uint8_t ucVideoData[ 1472 ];

After these 42 bytes, the video data can be written.

xEthernetHeader starts at a 4-byte aligned address + 2 bytes. This extra 2-byte offset has to do with the strange length of an Ethernet header (14 bytes).

I would make a new pxGetNetworkBufferForVideo(), which has an extra argument: the video buffer to be sent.

Remember that the buffer starts 10 bytes ( ipBUFFER_PADDING ) before the Ethernet header. In that space you will find a pointer back to the owning NetworkBufferDescriptor_t.

It all sounds complex, but you will get through it.

I am curious to hear how that works for you, and also about the gained efficiency.
If you want you can also send code as an attachment to your post.

heinbali01 wrote on Thursday, September 19, 2019:

One more thing to consider is the cache of the DDR memory. You can either switch it off for a region, or use cached memory. In that case, you will have to make sure that changes are flushed, and the the cache must be updated before reading data.
Personally, I found it more convenient to switch off the caching for e.g. 1 MB of memory. You video frames will be written once, and the EMAC peripheral will read them only once as well, so caching isn’t really profitable.

huangwm wrote on Friday, September 20, 2019:

Thanks Hein, using BufferAllocation_2 was alot more flexible than BufferAllocation_1. With BufferAllocation_2, I didn’t have to modify a structure which I feel better about than having to do it with BufferAllocation_1 .

Performance wise, I didn’t see a difference between the two schemes. A quick and dirty test of transferring ~900kB took about 3-4 ms.

Still new to the embedded world and using a Zynq platform, so will need to do some digging on how to turn off the caching my my region of memory. Currently, it is flushing the changes for one of the three frame buffers.

I’ve attached what I’ve added to BufferAllocation_2.c

heinbali01 wrote on Friday, September 20, 2019:

Disabling cache, you can google for Xil_SetTlbAttributes().
In one module I wrote :

		Xil_SetTlbAttributes( ( uint32_t )pucMemory, 0x1c02 ); // addr, attr

Performance wise, I didn’t see a difference between the two schemes

I found the same result in other projects: on some platforms a memcpy() is so fast that you can hardly gain performance by using zero-copy methods.
But still it is worth investigating the possibilities, with and without caching.


Yes, very good!

One thing:

        // No need to free any memory here

Don’t you give back the unused video buffers?

heinbali01 wrote on Friday, September 20, 2019:

A quick and dirty test of transferring ~900kB took about 3-4 ms.

I’m still a bit puzzled about this result. Is that 900 K-bit or K-byte? A capital ‘B’ would suggest Bytes. But that would mean that your Ethernet transport about 2 Gbps

huangwm wrote on Friday, September 20, 2019:

Sorry for the misunderstanding and giving out the wrong info without double checking… was a little excited about getting the implementation to work. When I reported the time of 3-4 ms, it was the time it took for the code loop through the entire video frame data and passing it to FreeRTOS_sendto() function. Of course, this was not a good representation of throughput through the Ethernet!
I turned on wireshark and measured the time it took to capture one entire frame- it took approximately 9.5 ms for ~900 kilobytes of data.

Looking through the code, there doesn’t seem to be any management of the memory for the video frame buffers. From what I understand so far, the VDMA is constantly writing to DDR memory with space reserved for three frame buffers as defined by a linker script. It does it in a rolling buffer fashion and does not care who else is operating in that region. One of the cores’ job on the Zynq is to read back the video data dumped by the VDMA and send it out over Ethernet. For zero-copy to work, I need a network buffer descriptor pointing to the correctly memory location where video data is stored.

Hopefully I’m not wrong, but it doesn’t seem like I would need to give back the video buffer because there doesn’t seem to be a concept of someone owning it. But! I may be wrong! After some long period of time, the application moves into a data abort exception handler (first time seeing this). Anyways, will need to debug further.

heinbali01 wrote on Monday, September 23, 2019:

it took approximately 9.5 ms for ~900 kilobytes of data

That sounds like a good performance! It means that you’re using about 75% of the total bandwidth. Mind you that the performance is likely to drop when there are other heavy users of the LAN.

I know nothing about digital video, but I know quite a bit about streaming audio.
In a streaming audio application, there will typically be 2 buffer pools: one pool of buffers that wait for an ADC. The ADC will fill these buffers, after which they’re send to the Ethernet module. That also has a pool (FIFO) of buffers. These buffers are send out to the LAN (by TCP, UDP, or multilink).
When a buffer has be worked on, it will be added to the next pool.
I’m attaching a picture that makes it clear.