I added more traces to the code, and now have better explanation of the issue.
ITM1 traces the xTickCount in the xTaskIncrementTick() before calling vApplicationTickHook(); if scheduler is suspended, the highest byte is the uxPendedTicks, and lower 16-bit of xTickCount; when scheduler is resumed, the 2nd highest byte is the uxPendedTicks.
ITM2 traces the adjusted xTickCount when Idle function exits the low power sleep mode inside vTaskStepTick(); the reload register value of systick before idle enters sleep; and the current counter after idle task exited sleep.
ITM3 traces all interrupts of my own code
ITM4 traces external interrupt (0x00000001); ulTaskNotifyTake(0x00000002); and ulTaskNotifyTake exits prematurely
Here is a capture of two interrupt cycles, the first one is normal, and the second is the one has the issue:
After my work task finished its job, it called ulTaskNotifyTake(timeout=5) to wait for the next interrupt. The xTickCount was 0x1B2F and the timeout should be 0x1B34. The SysTick reload register was 48000*4 + remainder of previous systick register value. The next interrupt woke up the idle task before the SysTick timed out, the idle task adjusted the xTickCount to 0x1B32, also set the systick timeout to the fraction of a single tick based on the systick current counter value. Everything was good so far. After finished the job, the work task called ulTaskNotifyTake() to wait again. In this case, right before idle task entered sleep, the Systick interrupt asserted.

The fraction of a single tick out of 0x143B7 was 34871, about 720us.
Because idle task suspended the scheduler, the systick interrupt ISR (0x01001B32) was not able to increment the xTickCount, instead it incremented the xPendedTicks.
The idle task entered sleep mode, and set the systick timeout to 0x23253, about 3 ticks after.
Why it was 3? Could be another task which needs wakeup earlier than the work task.
Here is the snippet from the portTASK_FUNCTION, scheduler was suspended before entering sleep mode:
static portTASK_FUNCTION( prvIdleTask, pvParameters )
…
xExpectedIdleTime = prvGetExpectedIdleTime();
if( xExpectedIdleTime >= configEXPECTED_IDLE_TIME_BEFORE_SLEEP )
{
vTaskSuspendAll();
{
/* Now the scheduler is suspended, the expected idle
time can be sampled again, and this time its value can
be used. */
configASSERT( xNextTaskUnblockTime >= xTickCount );
xExpectedIdleTime = prvGetExpectedIdleTime();
if( xExpectedIdleTime >= configEXPECTED_IDLE_TIME_BEFORE_SLEEP )
{
traceLOW_POWER_IDLE_BEGIN();
portSUPPRESS_TICKS_AND_SLEEP( xExpectedIdleTime );
traceLOW_POWER_IDLE_END();
}
else
{
mtCOVERAGE_TEST_MARKER();
}
}
( void ) xTaskResumeAll();
}
After 3 ticks, the systick interrupt asserted, since the idle task suspended the scheduler, the systick ISR had to increment the xPendedTicks again and xPendedTicks became 2.
HERE IS the issue:
Then the idle task exited the sleep, and saw systick elapsed 3 ticks and called vTaskStepTick() to adjust the xTickCount by 3, but the xPendedTicks was still 2 after the adjustment. After xTaskResumeAll(), the 2 xPendedTicks was added to xTickCount AGAIN.
This explains why the xTickCount increased by 5 while the true elapsed time was about 3ms.