Int32 overflow for Rx or Tx packet count?

Hi,

I’ve been running extensive system tests on a MCU which uses a RTOS+TCP V2.3.3 port for a STM32H7. I sent ~5000 UDP packets/s with each one containing a device command (just for the sake of creating traffic and stressing the firmware). It looks like this:
PC —COMMAND—> MCU ; MCU —RESPONSE—> PC.

I have now tried this with multiple SW iterations of our own firmware and always ended up in the same situation: After a few days or week of working perfectly fine, our device locked up. It seemed like the ETH_IRQHandler from here was permanently called, but HAL_ETH_IRQHandler could not determine the source of the IRQ (no handler function called within HAL_ETH_IRQHandler , just stepped over every condition in the function).

When I manually stepped in the disassembly this seemed to block the IRQ’s and I actually could reach out of the ISR and into some normal app code. I expected to be stuck in a forever loop (while(1){} etc.) but this was not the case and stepping was possible until I resumed the program execution in the debugger again.

My testing script logs some meta information out after every 10000 packets sent and the last log before timing out always looked similar to this:

bit/s:3063256 	 time: 2021-09-20 19:20:41.210802
current packet drop: 12313 / 2146630000 ( 0.0005735967539818227 %)

The actively sent amount of packets is 2.146.630.000 here, until at some point the device locked up. Accounting for the one or another ARP/LLMNR/… the managed switch in the test network does, the above number is awfully close to a int_32 overflow (2^31 or 2.147.483.648).

I also diff’ed the ETH peripheral registers but (with my limited knowledge) did not find any noticeable issues.

I know this is a very vague description and the error could basically be everywhere. But I’m pretty desperate and don’t know where to start searching.

Could there be any limitations regarding “packet enumeration” in the FreeRTOS+TCP stack?

Thank you in advance for any advice or help.

EDIT:
I did further digging and found the abstraction of the ETH peripheral registers ETH_TypeDef in
CMSIS/Device/ST/STM32H7xx/Include/stm32h7xx.h. A register called ETH_TX_PACKET_COUNT_GOOD is there implemented as __IO uint32_t MMCTPCGR; (naming differs from reference manual). Reading out it’s value with a debugger revealed that it’s exactly 0x80000000.

From your description (although based just on that this hypothesis could be wrong) it sounds like the MCU is continuously entering an interrupt, failing to clear the interrupt, exiting the interrupt, then re-entering because the interrupt is still asserted. Stepping through in the debugger may prevent the interrupt from being entered while the MCU is under the control of the debugger.

The function you link to just calls HAL_ETH_IRQHandler( &( xEthHandle ) ); What does that function do? Does it handle all the error interrupts too - or just the happy path? What happens if you add code to manually clear all pending interrupts?

1 Like

Hello!

I agree with what Richard has said. You should try clearing all the interrupts at the end of the IRQ handler. Try adding the below line here:

__HAL_ETH_DMA_CLEAR_IT( heth, 0xFFC7 );  /* Other bits are reserved as per the document below. */
/*  https://www.st.com/resource/en/reference_manual/dm00314099-stm32h742-stm32h743-753-and-stm32h750-value-line-advanced-arm-based-32-bit-mcus-stmicroelectronics.pdf */

While this is not a proper solution, this will help us debug your issue

You said:

What do you mean by this - Does this mean that as soon as you resume execution in the debugger, the code gets stuck in the interrupt again, or does it run fine once you stop and resume the debugger?

Thanks for your responses.

@rtel HAL_ETH_IRQHandler is a driver distributed within my FreeRTOS+TCP port (portable/NetworkInterface/STM32Hxx/stm32hxx_hal_eth.c).
Seems like at least some error handling was done:

            /* ETH interrupt. See heth->DMACSR for details.
             */

            if( __HAL_ETH_DMA_GET_IT( heth, ETH_DMACSR_RI ) )
            {
               ...
            }

            /* Packet transmitted */
            if( __HAL_ETH_DMA_GET_IT( heth, ETH_DMACSR_TI ) )
            {
               ...
            }

            /* ETH DMA Error */
            if( __HAL_ETH_DMA_GET_IT( heth, ETH_DMACSR_AIS ) )
            {
               ...
            }

            /* ETH MAC Error IT */
            if( __HAL_ETH_MAC_GET_IT( heth, ( ETH_MACIER_RXSTSIE | ETH_MACIER_TXSTSIE ) ) )
            {
                ...
            }

            /* ETH PMT IT */
            if( __HAL_ETH_MAC_GET_IT( heth, ETH_MAC_PMT_IT ) )
            {
                ...
            }

            /* ETH EEE IT */
            if( __HAL_ETH_MAC_GET_IT( heth, ETH_MAC_LPI_IT ) )
            {
               ...
            }

            ...

			/* check ETH WAKEUP exti flag */
			if( __HAL_ETH_WAKEUP_EXTI_GET_FLAG( ETH_WAKEUP_EXTI_LINE ) != ( uint32_t ) RESET )
			{
				...
			}

I did not try to modify the code yet.

@kanherea At least last time, the code got stuck in the ISR again, and I could not recover the device (no matter how often I stopped or resumed the target). Thanks for the snippet. I will insert that and will come back in a few days, when the error occurred again.

Hello @Dweb_2, is there any news about your issue?

I also have a STM32H7, and I wonder if I can simulate the problem.
Should I just exchange 0x80000000 UDP messages with the device? Are all packets answered by the device? What is the size of the packets?

Hi @htibosch,
the DUT is currently running, it will take approximately until next week in order to have the same amount of packets transferred.

At least from my observations and current assumptions, it will take 0x80000000 UDP packets. Average size of messages (transmitted and received) in my case is in the range of 100B.

In my test setup I tracked a very small packet drop of 0.0006%. I tried to investigate this, but the network stack definitely gets these messages. However there seems to be a issue with the UDP CRC from the remote host at this point here.

It’s most likely the fact that the network stack of my test script or the PC used for sending is old (Windows 7 era). Maybe some low level settings on the network interface card are also different. I never observed this on more recent machines/hardware.

You can mask ( disable ) all counter interrupts by setting these interrupt mask registers:

ETH->MMCRIMR =
    ETH_MMCRIMR_RXLPITRCIM |
    ETH_MMCRIMR_RXLPIUSCIM |
    ETH_MMCRIMR_RXUCGPIM |
    ETH_MMCRIMR_RXALGNERPIM |
    ETH_MMCRIMR_RXCRCERPIM;

ETH->MMCTIMR =
    ETH_MMCTIMR_TXLPITRCIM |
    ETH_MMCTIMR_TXLPIUSCIM |
    ETH_MMCTIMR_TXGPKTIM |
    ETH_MMCTIMR_TXMCOLGPIM |
    ETH_MMCTIMR_TXSCOLGPIM;

Here is a way to simulate the overflow problem without waiting for a week!

    /* Select the reset method, either half almost half-count (clear), or almost full count (set). */
    ETH->MMCCR &= ~( ETH_MMCCR_CNTPRSTLVL );
    ETH->MMCCR |= ETH_MMCCR_CNTPRST;

After these instructions, all counter are almost half-way. Within a few seconds you will see an overflow, without the interrupt.

Good luck

PS. I will change the STM32Hxx driver to disable interrupts from the packet and byte counters.

You also mentioned a small packet loss. I am not sure where this can happen. Do you miss incoming packets ( to the DUT ), outgoing ( from the DUT )?

There could be an under- or overflow in the EMAC/PHY, rare collisions on the LAN , lack of network buffers ( easy to check ).

Or the UDP checksum:

As you probably know, the UDP checksum must either be zero, which means: the checksum has not been set, otherwise non-zero. In that case, a checksum with a value of 0x0000 must be inverted to 0xffff.
Chances of getting a checksum of 0x0000 are 1 / 0x80000, like 0.0015 %, which is a bit more that your calculated packet drop of 0.0006 %.

Does you program get at this logging?
"prvAllowIPPacket: UDP packet from %xip without CRC dropped"?

Thanks for the explanations. I will try it out! Also looking forward to use your modified STM32Hxx drivers.

The checksum was wrong which is why I observed some packet drop (from the DUT). The respective log message you mentioned was printed out from the stack. I debugged the raw ETH frame in RAM, and it truly looked fine, except the checksum of 0x0000.

Also looking forward to use your modified STM32Hxx drivers.

Well, it is nothing surprising:

#346 IPv4/single: disable counter interrupts in the STM32Hxx driver

#347 IPv6/multi: disable counter interrupts in the STM32Hxx driver

I added the following to HAL_ETH_Init() :

/* Disable the interrupts that are related to the MMC counters. */

heth->Instance->MMCRIMR =
    ETH_MMCRIMR_RXLPITRCIM |  /* RXLPITRC */
    ETH_MMCRIMR_RXLPIUSCIM |  /* RXLPIUSC */
    ETH_MMCRIMR_RXUCGPIM |    /* RXUCASTG */
    ETH_MMCRIMR_RXALGNERPIM | /* RXALGNERR */
    ETH_MMCRIMR_RXCRCERPIM;   /* RXCRCERR */

heth->Instance->MMCTIMR =
    ETH_MMCTIMR_TXLPITRCIM | /* TXLPITRC */
    ETH_MMCTIMR_TXLPIUSCIM | /* TXLPIUSC */
    ETH_MMCTIMR_TXGPKTIM |   /* TXPKTG */
    ETH_MMCTIMR_TXMCOLGPIM | /* TXMULTCOLG */
    ETH_MMCTIMR_TXSCOLGPIM;  /* TXSNGLCOLG */

and I tested by putting the counters at “almost half” value.