problem with lwIP demo on AT91SAM7XC-EK

rainer32 wrote on Thursday, April 20, 2006:

Hi all,

i just got a brand new AT91SAM7XC-EK and tried the lwIP demo. it compiled fine with arm-elf-gcc and i flashed it via jtag (with openocd). The web server demo works fine at first, but after a couple minutes at most, it’s no longer accessible (trying auto-refreshing web access only, with no other IP traffic to the board). when the web access blocks, the leds still keep blinking the same, so i guess freertos didn’t freeze, but the ip stack seems dead (doesn’t even take a ping any more, which had worked before trying the web page). the link/act led still blinks showing ethernet packets, but that’s all. It happens systematically, though not always at the same time / page hits count. with auto-refresh web access alone, i once got over 400 int the “page hits” count (which always gets incremented by 3 between refreshes), usually it stops before. when i try to access from another machine simultaneously or with a simultaneous ping, it can go down to 25 “page hits” before the ip stack freezes (btw since the reset button doesn’t work in this demo, i have to take the usb power cable out between tries.) But as i said: the 4 leds near the joystick go on blinking normally. 
any clue what might be the source of my problem?


rainer32 wrote on Thursday, April 20, 2006:

from what i can read in previous threads, it might rather be in the emac code and not in the lwip stack (i might have misunderstood, cause there’s also an other emac hang problem for which the solution seems to be found but still not integrated into the main source tree ?) and is also experienced by others,just like me, in the latest 4.0.1 release version. i couldn’t find any hint to a solution though. is there any already?


nobody wrote on Thursday, April 20, 2006:

Are you using it under heavy load.  There was a resolution just posted to this thread:

I have left the demo running with 3 PC’s connected via http, and two pinging large amounts of data and it worked over night.  I have not tried with V4.01 yet.

Be aware that the command line GCC version seems to use more stack than the same code built with Rowley.  Rowley must do something to the libraries I think.

rainer32 wrote on Thursday, April 20, 2006:

thanks for the info, but i’m afraid it must be something else:
- for one thing, i’m using 4.0.1 and the ulEventStatus is already in function vEMACISR as described in the thread
- the mac already begins with 0 by default)
- my at91sam7xc-ek board sits on the same (unmanaged) ethernet switch as my pc, but i don’t have a high packet count, just page reloads and sometimes a ping test.
- the other thing being about flags in line 617 of SAM7_EMAC.c, where the thread talks about removing AT91C_EMAC_CAF or even the whole call
note that in 4.0.1, there’s another flag set by default in that line: AT91C_EMAC_DRFCS.
Anyways, i tried just removing AT91C_EMAC_CAF and i tried commenting out the whole line, it doesn’t change a thing: the web server works for some time (ranging from tens of seconds to one or two minutes) then blocks, the leds still blinking on
So there must be some other cause to my problem
(unless i have overlooked something?)


nobody wrote on Friday, April 21, 2006:

Which GCC version are you using?  Does the served page show the task stacks?  Are any shrinking?

rainer32 wrote on Friday, April 21, 2006:

- I’m using arm-elf-gcc 3.4.3 (from the latest gnuarm for linux x86, which unfortunately is only version 3.4.3)
- The page indeed shows a column named “Stack”
- Indeed the sizes shown under that column do shrink for ALL processes, except for process “Check” which keeps a stack size of 83. examples (CAUTION: Format: Process Name, Stack Size at first page hit, stack size at Hit #689, stack size at last Hit before freeze 1764):
name    @0   @689 @1764
WEBSvr  170   69   69
lwIP    351  348  348
ETH_INT 300  296  296
IDLE     74   67   67
PolSEM1  81   68   68
they don’t all shrink at the same moment, but
they all shrink (except Check)

But there’s something else strange about this (see next post)

rainer32 wrote on Friday, April 21, 2006:

so here’s the strange thing i noticed while having a look at the stack size variations:
in order to get the stack sizes, i pulled the pages with a script running wget in a loop, resulting in a  much higher page load frequency than the auto-refresh with firefox. however, unlike the latter, this fast request bombing with wget didn’t bring the web server down. I had 4000 hits and going without any problem, whereas the firefox refresh alone sometimes brought the server down in less than 40 hits. So in order to show you the stack numbers i gave you, i also started a firefox refresh running in parallel to bring the server down. i’ll try and see if the stack sizes also go down with wget alone.

also, i noticed that the “page hit” was incremented by one between wget calls (without firefox), whereas it got incremented by 3 between firefox refreshes (without wget).
Thus, a possible hypothesis could be that wget probably sends very few http headers, whereas firefox sends tons of them (that i know) and that lwIP tries to buffer them all instead of ignoring those it doesn’t need) and/or starts sending before all headers of the http request are there and still gets headers sent from the client, which lwip would interpret as a new page hit, thus explaining the +3 increment between page refreshes in firefox ( for linux x86)



rainer32 wrote on Friday, April 21, 2006:

ok so now i launched my wget loop alone and right now i’m at 27000 page hits and going on, with no freeze. however, the stack sizes do also go down, albeit much slower than in the case with firefox.
also the increment between wget page loads is one as expected (firefox: increment 3 between refreshes)


rainer32 wrote on Friday, April 21, 2006:

update on the wget loop:
starting at about 40000 or 50000 page hits (i deleted the files to gain disk space, so i don’t know it exactly) it gets painfully slow (and it’s not the pc’s fault) now i need 2 sec per page (instead of about 10-20 pages/sec at the beginning).
but the server is still running, and the leds blink at normal speed.

anyways, to bring the ip stack down quickly, use firefox on the client (preferably for linux) and wait for the page reloads to stop after a couple minutes.

nobody wrote on Friday, April 21, 2006:

Some older ARM GCC distributions had problems with interrupt entry/exit code - is this possible?

nobody wrote on Friday, April 21, 2006:

Firefox always wans to get a favicon.ico for the webpage. this increases the page hits because it is treated by the example like a web page request.

there is also no error checking in the example code. i recognized that writing to a netconn if a further write returned an error leads to an unpredictable behave. there lists of tcp control blocks get meshed up. to be precise the last item in the list point with ->next to the first item instead to NULL. if there is a new search through the list with "while(pcb->next)" the program hangs up because the lwip thread has the highest priority. So no blinking etc.

the crash of the stack happends if the connection is resetted or aborted from the client between listen and close. firefox will reset a connection if there is a timeout or error. if this is done at the place between listen<->close and you dont handle the error given by read or write with a  close of the connection you will get into trouble.

i did an example with the socket interface and i am checking every error now and it runs and runs and runs with firefox. i have 6 simultanous windows open and i cloud press the refresh button as fast and often i like. (Pressing the reset button in firfox leads to an reset of the connection).

nobody wrote on Friday, April 21, 2006:

Excellent.  Are you able to give more details of the changes you made or send them in?

nobody wrote on Friday, April 21, 2006:

just make some error checking

if(netconn_write() != ERR_OK){

there is also an "bug" in the lwip code when you want to close a tcp connection. sometimes firefox says "No data in document" (maybe its an other text, i just translated the german motice). lwip just kills it, even if there is still data in a buffer to send/resend. but this is an other story. i added an additional semaphore and checking for empty buffer before closing in lwip. the solution for this is a fast dirty hack nothing professional. if you want to try i can post it in the next days.

nobody wrote on Friday, April 21, 2006:

i noticed that this is not all in the netconn api. the error checking should be expaned, which error happend. i used the socket api and just checking for fail was enough.

rainer32 wrote on Saturday, April 22, 2006:

ok there are a couple things here:

- someone mentioned problems with older compilers, so i ditched the 3.4.3 from gnuarm and built a fresh new arm-elf-gcc in versions 4.0.3 and 4.1.0 (with newest binutils and newlib) -> the stack sizes still go down, the ip stack still freezes (leds keep blinking). actually, it might be a subjective feeling, but i have the impression that with gcc 4.1.0 it even freezes faster.

- about the compiler: now, independently of the ip stack freeze thing, did i understand right that the stack sizes should not go down? is there a possibility that the stack size thing is a compiler issue or is it somewhere in freertos/lwip/the web demo ? if it’s a compiler issue, is there any known thing about anything special that has to be done when configuring and building gcc so as to avoid these (and any other) problems ? what gcc version is recommended, what options?

- the favicon thing: ok, right, that would explain an additional increment, but does firefox do a second try at getting the favicon? cause the increment between page refreshes is 3, not 2

- i’m not sure if i understood correctly about your error checking patch stuff. could you specify in which source files i have to change what lines (and how) or a link to the patched files or whatever so i could try that and see if it makes a change over here with my gcc and at91sam7xc-ek?

- about that netconn and socket api (last post): i didn’t understand that at all. which api does the demo use and do i get it right that the socket api works without freeze after adding a fail check? if so, do you have a demo that is supposed to work without freeze? i’d like to try that.

- about the emac: is the emac isr and so cleared as a suspect, or could there still be some problem in there?

rtel wrote on Sunday, April 23, 2006:

I’ve just downloaded FireFox to give this ago and here are my preliminary findings:

+ I can intermittently duplicate part of the described problem.  Some executions the refresh works well with no degradation.  On a few occasions there has been a noted degradation after approximately 40 refreshes.  Taking a network sniffer log during the degradated performance shows that lwIP/SAM7 is taking a couple of seconds to send ACK’s to the first GET request, causing some timeouts and retries.

+ I only ever get the refresh count incrementing by 1.

+ The stack sizes settle down with plenty left over (I’m using Rowley), then do not get any smaller.

+ The only time I have seen netcon_write return an error is when I closed down the browser mid transmission.

There is nothing obviously different in the network log between using FireFox and IE, so this is going to take some more investigation.


rainer32 wrote on Sunday, April 23, 2006:

You wrote:
> The stack sizes settle down with plenty left over (I’m using Rowley), then do not get any smaller.

i realize that what i wrote was misunderstandable. the limit of my decreasing stack sizes is not 0. indeed they “settle” to a value here too. however, i meant that:
- the initial value isn’t kept, they do go down from the initial value
- i can’t pretend that the limit value is reached at any time or what it even is. i have observed the case that one task still decreased its stack size (by 1) between 20000 and 40000 page hits (wget alone, no firefox, no ping), so at no time i can pretend that they stopped decreasing. nevertheless, the ip stack freeze always occurs with stack size values of the same order of magnitude (though lower) than the initial stack sizes.

nobody wrote on Sunday, April 23, 2006:

I have been using V3.2.4 and not seen problem.  Upgrade to 4.0.1 and problem happen some time, not all time.  The emac service routine has new test in, causing problem?

rtel wrote on Thursday, May 04, 2006:

Having investigated the behviour of the lwIP demo I have determined the following:

1) There does seem to be an issue in V4.0.0 and V4.0.1.  These versions had modified ISR code in accordance with the Atmel errata (which is now included in the datasheet btw).  Removing part of the change seems to fix most of the problems.

In the function vEMACISR() is SAM7_EMAC_ISR.c change the line:
ulEventStatus = AT91C_BASE_EMAC->EMAC_TSR;
ulEventStatus = 0;

This effectively stops it having any effect.

2) As Tx interrupts apparently get called when the Tx buffer does not need clearing the function vClearEMACTxBuffer() needs to be made more robust.

In the function vClearEMACTxBuffer() within the file SAM7_EMAC.c move the lines:

    /* Start with the next buffer the next time a Tx interrupt is called. */

    /* Do we need to wrap back to the first buffer? */
    if( uxNextBufferToClear >= NB_TX_BUFFERS )
        uxNextBufferToClear = 0;

up into the if statement immediately above them.

With these two changes the demo again seems to be very robust (to my tests anyhow).

It should also be noted that the lwIP buffers are not sized to be optimal for throughput.  Increasing the size of the Tx buffer (and reducing the number of buffers) will increase throughput considerably.  The sizes in the download are set for coverage testing of the driver source code.


nobody wrote on Thursday, June 01, 2006:

I’ve applied this patch, but I believe there’s another thing to fix… in emac.h:

must change
#define AT91C_LENGTH_FRAME 0x07FF



length counter is a 12bit counter not eleven. This isn’t a problem unless you have >1024 byte packets comming in for which some pBufs don’t get their Ownership bit cleared.

my $0.02