Zynq - TCP: Improve speed

hannes23 wrote on Wednesday, June 20, 2018:

Hello,

herewith I’d like to share my experience with improving the TCP communication.
In my case I could get more than 20 % gain in TCP speed at 1000 Mbps.

Following things are necessay:
1st: Re-map the OCM (on-chip-memory) from bottom to top of address-space.
2nd: Force the linker to place the ucNetworkPackets - buffers into the OCM space.

The remapping is done by a macro calling some certain assembler code directly
after starting in main().

My code is :

int main( void )
{
xil_printf( “Hello from FreeRTOS main\r\n” );
configASSERT( configUSE_TASK_FPU_SUPPORT == 2 );
xil_printf( “configUSE_TASK_FPU_SUPPORT (FreeRTOS.h) is set to %d\r\n”, configUSE_TASK_FPU_SUPPORT );

	// Remap all 4 64KB blocks of OCM to top of memory and enable DDR address filtering
	MY_REMAP();

...
...

The configUSE_TASK_FPU_SUPPORT - part could of course be omitted if not used.

The code (found somewhere in the Xilinx forum) for the MY_REMAP() define is:

#define MY_REMAP() asm volatile(
“mov r5, #0x03 \n”
“mov r6, #0 \n”
“LDR r7, =0xF8000000 /* SLCR base address / \n"
"LDR r8, =0xF8F00000 /
MPCORE base address / \n"
"LDR r9, =0x0000767B /
SLCR lock key / \n"
“mov r10,#0x1F \n”
"LDR r11,=0x0000DF0D /
SLCR unlock key \n”
“dsb \n”
“isb /* make sure it completes / \n"
"pli do_remap /
preload the instruction cache / \n"
“pli do_remap+32 \n”
“pli do_remap+64 \n”
“pli do_remap+96 \n”
“pli do_remap+128 \n”
“pli do_remap+160 \n”
“pli do_remap+192 \n”
"isb /
make sure it completes / \n"
“b do_remap \n”
".align 5, 0xFF /
forces the next block to a cache line alignment / \n"
"do_remap: /
Unlock SLCR / \n"
"str r11, [r7, #0x8] /
Configuring OCM remap value / \n"
"str r10, [r7, #0x910] /
Lock SLCR / \n"
"str r9, [r7, #0x4] /
Disable SCU & address filtering / \n"
"str r6, [r8, #0x0] /
Set filter start addr to 0x00000000 / \n"
"str r6, [r8, #0x40] /
Enable SCU & address filtering */ \n”
“str r5, [r8, #0x0] \n”
“dmb \n”
);

Next step is to create a memory section “.ocm” by changing the linker-desciption file.

Following changes are to be done:
In the memory section add:
ps7_ocm : ORIGIN = 0xfffc0000, LENGTH = 0x3fe00

In the section description add:
.ocm (NOLOAD) : {
__ocm_start = .;
*(.ocm)
__ocm_end = .;
} > ps7_ocm

Final step is to inform the buffer definition, that the buffers should be placed into the ocm.

In file: NetworkInterface.c add the ocm-section attribute. It then should be:
static uint8_t ucNetworkPackets[ ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS * niBUFFER_1_PACKET_SIZE ] attribute ( ( aligned( 32

) ) ) attribute ((section (".ocm")));

After compile and link one could inspect the map-file to see, if the ocm section is successful generated and populated.
It looks like:
.ocm 0x00000000fffc0000 0x30000
0x00000000fffc0000 __ocm_start = .
*(.ocm)
.ocm 0x00000000fffc0000 0x30000 ./src/Ethernet/FreeRTOS-Plus-TCP/portable/NetworkInterface/Zynq/NetworkInterface.o
0x00000000ffff0000 __ocm_end = .

All hints and changes are of course without my responsibility and warrenty
If anybody has other or additional changes or hints to improve the speed in TCP communication let me please know.
Especially someone could comment, if it makes sense to push other buffers or variables into the ocm.

Greetings to all.

rtel wrote on Wednesday, June 20, 2018:

Really appreciate you taking the time to write this up.

heinbali01 wrote on Wednesday, June 20, 2018:

Hi Johannes, thanks a lot for sharing this. I’m afraid I can not comment on it as I don’t know enough about the Zynq memory handling.
But I would be curious to see the results of a test with iperf3.
I’ll attach the latest version ( v3.0d ) of the iperf server to this message.

To activate the server wait for +TCP to be ready and call:

    void vIPerfInstall( void );

You can start a test on the host with this command:

    iperf3 -c 192.168.2.114 --port 5001 --bytes 100M [ -R ]

The reverse flag ( -R ) causes the Zynq to send data, in stead of receiving data.

heinbali01 wrote on Wednesday, June 20, 2018:

I wrote:

I would be curious to see the results of a test with iperf3.

It would be great if you can start two sessions simultaneously in two directions:

    iperf3 -c 192.168.2.114 --port 5001 --bytes 1G
    iperf3 -c 192.168.2.114 --port 5001 --bytes 1G -R

We recently saw a problem with Zynq: incoming packets were dropped under heavy traffic, causing very slow transfer speed for the first session ( the one without -R ).
I am curious to see if a faster memory access will prevent these problems

hannes23 wrote on Thursday, June 21, 2018:

Hello Hein,

here are the results of the iperf3 tests I’ve done at my Zynq7000, running at 666MHz.

First I had to change the ip-address and port-number according to needs of our firewall.
Then I got a stack-fault. Maybe that some other tasks wich are running on my system caused this.
These other tasks don’t use much CPU time, I believe, so I didn’t change the software for the iperf tests.

After setting the stack-size to 1000 everything was fine.

Next I changed the window- and buffer settings to those values I used at my HTTP-server work.
The code after change is:

#ifndef ipconfigIPERF_TX_BUFSIZE
#define mySETTINGS
#ifdef mySETTINGS
#define ipconfigIPERF_TX_BUFSIZE ( 128 * 1024 )
#define ipconfigIPERF_TX_WINSIZE ( 48 )
#define ipconfigIPERF_RX_BUFSIZE ( ( 80 * 1024 ) - 1 )
#define ipconfigIPERF_RX_WINSIZE ( 24 )
#else
#define ipconfigIPERF_TX_BUFSIZE ( 65 * 1024 ) /* Units of bytes. /
#define ipconfigIPERF_TX_WINSIZE ( 4 ) /
Size in units of MSS /
#define ipconfigIPERF_RX_BUFSIZE ( ( 65 * 1024 ) - 1 ) /
Units of bytes. /
#define ipconfigIPERF_RX_WINSIZE ( 8 ) /
Size in units of MSS */
#endif

#endif

By the way: Why should or must the RX_BUFSIZE be uneven?

I’ve done tests with original bufsizes with and without -R, and also tests with my settings, also both variants.

Finally I did the concurrent test as you supposed and saw that indeed the performance of the one without -R really
dropped. On my debug uart I got lots of messages like:
SACK[4503,34508]: optlen 12 sending 14583407 - 14584867

The test results are:
Original bufsize:

[ 4] 0.00-19.59 sec 1.00 GBytes 438 Mbits/sec sender

[ 4] 0.00-19.59 sec 1024 MBytes 438 Mbits/sec receiver
and for the reverse mode:
[ 4] 0.00-27.12 sec 37.0 Bytes 10.9 bits/sec 4294967295 sender

[ 4] 0.00-27.12 sec 1.00 GBytes 317 Mbits/sec receiver

mySettings bufsize:

[ 4] 0.00-16.43 sec 1.00 GBytes 523 Mbits/sec sender

[ 4] 0.00-16.43 sec 1024 MBytes 523 Mbits/sec receiver
and for the reverse mode:
[ 4] 0.00-13.81 sec 37.0 Bytes 21.4 bits/sec 4294967295 sender

[ 4] 0.00-13.81 sec 1.00 GBytes 622 Mbits/sec receiver

For completeness I added the results in a file.

Greetings

heinbali01 wrote on Saturday, June 23, 2018:

Thanks Johannes, for these detailed and systematic measurements.
It looks like using your memory settings makes the Ethernet communication faster, at least a 20%.
But unfortunately, it does not help against the packet loss in this case:

    My settings bufsize parallel:
    Connecting to host 169.254.79.19, port 4503
    [  4] local 169.254.214.213 port 35510 connected to 169.254.79.19 port 4503
    [ ID] Interval           Transfer     Bandwidth
    <snip>
    [  4]   3.00-4.00   sec   512 KBytes  4.19 Mbits/sec                  
    [  4]   4.00-5.00   sec   896 KBytes  7.34 Mbits/sec                  

In this case with heavy two-way traffic, incoming packets are being dropped.
Earlier I found that it helps to decrease the packet size of TCP packets ( MSS ).\

and saw that indeed the performance of the one without -R really
dropped. On my debug uart I got lots of messages like:
SACK[4503,34508]: optlen 12 sending 14583407 - 14584867

That is indeed a sign of packets being dropped

heinbali01 wrote on Saturday, March 02, 2019:

HI Johannes, I finally solved the problem of the lost packets during concurrent transmissions.

See [this post] (https://sourceforge.net/p/freertos/feature-requests/126/)

See x_emacpsif_dma.c in the function emacps_send_message((), it is very essential to read back the register that was just set:

    XEmacPs_WriteReg( ulBaseAddress, XEMACPS_NWCTRL_OFFSET, xxx );
+    /* Reading it back is important compiler is optimised. */
+    XEmacPs_ReadReg( ulBaseAddress, XEMACPS_NWCTRL_OFFSET );

Now I started two concurrent sessions with iperf3, and both transported an equal amount of data.

This is the new Zynq driver