Bug in the EMAC ISR

molnarzs wrote on Tuesday, April 18, 2006:

It may explain the eventual EMAC hangs. The problem, with scratching a solution that worked in my case, can be found in my blog:

During my actual project (using an Atmel AT91SAM7X controller) I encountered a bug that also exists in the lwIP demo of FreeRTOS. I tried it and the problem really occurs. It is likely that the bug does not affect an embedded web server like the demo (in normal circumferences), but, as the the problem is in the very simple ISR, it might be found in several different projects.

The observation: if we send a fast burst of huge amount of data for a couple of seconds (with a packet generator for example), the EMAC controller stops answering and does not recover. If we do it for a minute, it surely hangs.

The reason: The data arrives faster than the controller is able to process it. So in a given moment, all the receive buffers get used in the same time. The controller does not generate RCOMP interrupts any more, it generates RXUBR interrupts only as all the buffers are used, all the used bits are set. If this interrupt is not handled in the EMAC ISR - the networking task is never signalled - the data in buffers never processed - all the buffers remain used - the controller stops receiving data.

The solution I used in my project: enable RXUBR as well. When this interrupt occurs, it is also considered as frame arrived. If fact, someone really wanted to send us something:

if (status AT91C_EMAC_RCOMP || status AT91C_EMAC_RXUBR)
    /* A frame has been received, signal it */
    _n_ethernet_is_frame_received = TRUE;

This will trigger data processing. By checking the ownership bits (the code should do it when it looks for the frame’s last buffer), the code can easily detect that all the buffers are used, so a data burst must have arrived. It than starts the appropriate recovery action (in my case, I simply releases all the buffers by dropping the content, until the data rate normalizes).


molnarzs wrote on Tuesday, April 18, 2006:

Someone wrote in the "tcp connection resetting" thread, relating to this bug, that:

>quote, " + Updated the SAM7X EMAC drivers to take
>into account the hardware errata 
>regarding lost packets.
>" from http://www.freertos.org/History.txt

>Could this be something to do with your problem?

Where can I find that HW errata? We have lost packet problems as well, but it has no link with the interrupt issue, the source of that problem got clear, and the described fix made solved it.

rtel wrote on Tuesday, April 18, 2006:

I was told directly by Atmel about the problem discovered in the EMAC hardware, and that this would be described in an errata "very shortly".  I therefore put the reference to the errata in the change history.  However, I have not actually seen the errata, and looking on the Atmel WEB I cannot find it still.  Maybe it is not written yet :wink:

The problem was described to me as:
+ An interrupt occurs, the AT91C_EMAC_RCOMP bit is set in the interrupt status register.
+ Under particular conditions, a timing glitch occurs in the device.  AT91C_EMAC_RCOMP is cleared again :frowning:
+ The interrupt is processed, but AT91C_EMAC_RCOMP is already clear so the interrupt source is not known from the interrupt status.

I got around this by finding the interrupt source by also checking the AT91C_EMAC_REC in the RSR which (I’m told) should remain set even under the error condition.  If you are writing a more complete driver then the other bits in the RSR also need double checking even if the interrupt status bit is clear.

This is of course difficult to test as generating a glitch is not something that can be done on demand.

Could it be that your original problem was that you were only checking the interrupt status register (as per the datasheet), not recognising that a frame had been received (as the AT91C_EMAC_REC bit had been mistakenly cleared), and therefore your buffers were filling up?


molnarzs wrote on Tuesday, April 18, 2006:

Thanks for the info, that was really valuable!

In my case, that was it. I checked it, and both solutions (yours and mine) solves the problem, but yours certainly is much simpler and to the point.

So the actual EMAC hangs (in version 4.0.1) must have a different source.

molnarzs wrote on Tuesday, April 18, 2006:

Hmm, I think the problem still exists, the right solution would be combining the two. We do not have to enable and check RXUBR, checking AT91C_EMAC_REC is OK, but my proposed recovery method must be there.

In ulEMACInputLength (SAM_EMAC.c), by the time you get to the section that looks for the frame length (line 318), all the buffers may get full. In this case, either you never get out of the loop (wrong or too long packets, etc.) or the function returns 0 - the packets are not processed. In both cases, the SW stops responding to the network requests.

In my SW, I changed the RXUBR checking to your one while keeping my the recovery code, ant it works fine, survives the heavy load. But the FreeRTOS, as is, hangs when I apply the same load.

nobody wrote on Tuesday, February 20, 2007:

Hello All,

I realize that this thread is a bit old, but I think that I’m running into a similar issue. This thread is the closest description to the problem that I can find.

I’m running FreeRTOS v4.2.0 lwIP_Demo_Rowley_ARM7 port for the AT91SAM7X256. Compiled with Rowley CrossStudio 1.6.

If I send a continuous stream of ping packets and hit the web server from 2 different machines at about the same time, the system locks up after a few seconds.

I’ve determined that the processor is DABORT’ing. I cannot seem to get much useful data out of the MC_AASR and MC_ASR registers. But it is DABORT’ing because: UNDADD = 1, MISADD = 1, ABTSZ = 2, all others are 0.

This thread mentions something about a bug in the EMAC ISR due to a hardware problem. I read the errata about the issue, but I’m not sure how to implement the changes in the ISR.

Has the EMAC ISR problem been fixed in the latest version of the port? Or am I running into a different issue with the ISR?

Thanks for any hints or advice!

nobody wrote on Tuesday, February 20, 2007:

Are you using GCC?  There has been some discussions recently.  Are you using -O0 optimisation?  If so try -O1.  Also try -fomit-frame-pointer option.  Finally, try reducing the heap size, you might just be running out of stack space.  These are all things discussed so you may have tried already.

nobody wrote on Tuesday, February 20, 2007:

Thanks for the tips!

I was using -O1, but not the -fomit-frame-pointer option. I tried it and it made no difference.

I’ve also tried reducing the heap size. I also tried increasing the default stack size for the “basic web server” to 1024. Neither change made a difference. The darn thing still locks up with a DABORT under a load. :-\

Thanks again for the help.

I’m going to continue debugging it. Hopefully I’ll find the issue.

nobody wrote on Monday, May 07, 2007:

has anyone found a solution? could the original `nobody` of this post reply with his results?