My ISR is adding a message to a task's message queue, but the task doesn't start for 80usec...?

My ISR is adding a message to a task’s message queue, and then using portYIELD_FROM_ISR() to wake the task up. After the ISR exits, the Scheduler fires up and wakes the task up properly. However, what is completely unexpected is that the task then waits (or appears to wait) for around 70-90usec before actually reading stuff from its message queue. (And it’s doing this every time.)

Here’s what it looks like in Segger SystemView (you can see the orange ISR waking up the turquoise task (“pxHigherPriorityTaskWoken”), but the task stays pending until the Scheduler actually runs it):

I’m going crazy here, because I just can’t figure out what the task is doing in that 70-90usec gap between the scheduler starting the task and the task actually reading from its message queue. (As far as I can see, there should be no gap there at all.)

Any suggestions as to what might be happening here? It’s an ISR and a task, the task is just waiting on its message queue when the ISR gets triggered, so there should be no additional task interactions. I haven’t found any remotely similar problem on this forum. :frowning:

(This is running on an i.MX RT1062.)

I’m not sure how to read that trace as it appears to shows the task blocking after 81.2us, so presumably it has done two queue reads and one queue write in the mean time, but the queue reads and writes are not show in the graphic - only the text. Am I mis-reading?

Hi Richard,

What’s happening in the diagram is that the (orange) ISR is posting a message to the (cyan) tcpip_thread’s message queue (starting up the tcpip_thread, “Task Ready”), and then exiting (“Isr Exit”). The next thread to wake up is the scheduler (white, in the middle), which then also exits (“ISR Exit”) while starting the (cyan) tcpip_thread (“Task Run”).

The tcpip thread is then pausing (somewhere, somehow) for 73 usec, before grabbing the message from the queue (the first “xQueueGenericReceive”). Because counting semaphores are implemented as Queues, locking and unlocking the tcpip mutex then appears as “xQueueGenericReceive” and “xQueueGenericSend”. So the actual work being done by the tcpip task to process the message is taking about 2usec (between the lock and unlock). Note that the three queue actions appear as three tiny vertical lines towards the right hand end of the cyan task bar.

So, my question is what is going on for the 73usec between the tcpip thread waking up (“Task Run”) and the tcpip_thread actually getting the message off its message queue (“xQueueGenericReceive”).

From my perspective, it appears that the tcpip_thread might be waking up in a WFI or WFE state, and is then spending some time coming out of sleep mode before actually executing. Though that sounds a bit mad, the only other mechanism I can think of that might be holding the processor back would be a hardware fault causing continual interrupts (which sounds even madder).

So… what gcc attribute(()) should I be using for IRQ handlers? In terms of ((irq)), ((naked)), ((interrupt)) or whatever. I can imagine if an IRQ handler gets the processor status wrong on exit, it might cause all manner of practical mayhem, such as unexpected sleeping etc.

Thanks very much!, Nick

Assuming that the tcpip_thread is the LWIP thread, it is likely blocked in this function in one of the xQueueReceive calls. Is it possible to emit an event right after these xQueueReceive calls which will be visible on this plot? This will tell us whether the time is spent in the xQueueReceive or after that?

Hi Gaurav,

I’ve added some extra markers, just so you can see exactly what I’m seeing:

image

On the left, you can see a pair of markers wrapped around vTaskSwitchContext(): this is called by xPortPendSVHandler() in portable/GCC/ARM_CM4F/port.c, i.e. the PendSV ISR handler. The code either side of that is just plain linear ARM assembly code, pushing and popping registers to do the context switch. So, this happening exactly when the tcpip_thread task is being switched in, just as you’d hope.

On the right, wrapped either side of the first small vertical line (the xQueueGenericReceive) is a second pair of markers I wrapped right at the top of the loop in xQueueReceive():

for( ;; )
{
	taskENTER_CRITICAL();
	{
		SEGGER_SYSVIEW_MarkStart(MARKERINDX_EVENT_1);

		const UBaseType_t uxMessagesWaiting = pxQueue->uxMessagesWaiting;

What has been called immediately before that is portYIELD_WITHIN_API(), which just sets the PendSV flag, triggering the PendSV IRQ Handler.

In summary, I can see the PendSV handler executing OK (at the left) and I can see the tcpip_thread executing OK (on the right), but I just can’t work out what is happening in the 70-90usec between the two.

Part of me wonders whether this is a Cortex M7 chip bug, i.e. whether setting the PendSV bit is somehow deliberately placing the (non-supervisor) registers in a WFE state (or something like it), which is then being resumed into once the PendSV code exits. (I’m presuming that the PendSV code executes in Supervisor mode, but I may be wrong.) If this is right, then the workaround would be for the PendSV handler to clear that task’s status bit before exiting. But… that sounds a bit crazy.

Any suggestions?

Thanks, Nick

I assume you added these events in the xPortPendSVHandler, right?

Can you try to step through the assembly and see where the control goes when this line executes? Then, I’d put an event right at that line in addition to the xQueueReceive that you added.

The “bx r14” / “bx lr” line jumps to:

image

i.e. two opcodes after the portYIELD_WITHIN_API() call completed.

The branch at the bottom of this sequence (the “b.n”) jumps to the top of the loop I cut and pasted above. The first thing there is a call to vPortEnterCritical() which compiles down to about 20 opcodes at -O0; and then there’s the Segger marker I added in. If I count the opcodes between the bx r14 and the Segger marker, it’s no more than 30 (at 600MHz).

And yet it seems to be taking 70+ usec. :frowning:

I added another pair of markers around the taskENTER_CRITICAL() call at the top of the loop, these are taking 0.623 usec (almost all of which is Segger SystemView overhead):

for( ;; )
{
	SEGGER_SYSVIEW_MarkStart(MARKERINDX_EVENT_2);
	taskENTER_CRITICAL();
	SEGGER_SYSVIEW_MarkStop(MARKERINDX_EVENT_2);
	{
		SEGGER_SYSVIEW_MarkStart(MARKERINDX_EVENT_1);

I’ve highlighted these with the blue line on the right, so these are happening after the 70-90 usec delay.

image

Hi @nickpelling ,

If this is not helpful I apologise for the noise :slightly_smiling_face:
I’m no expert on ARM assembly but I believe the DSB and ISB instructions are probably the issue as they will potentially take a while to complete based on the state of the data and instruction caches.

See this link for the instruction description.

I can imagine these operations might take a while.

Kind Regards,

Pete

Right, they may take a while to complete but I don’t think they take N*10 us.
It’s really strange… also given that you have caches enabled it’s also unlikely that a lengthy code fetch from flash hits you exactly there systematically :thinking:
BTW Do you run from internal flash or RAM ? Also do you have non-FreeRTOS covered interrupts/ISRs running ?

Pete Bone: it’s a good question, thanks! In fact, in one of my tests I tried not writing to memory in the preceding ISR to see if there was any kind of dsb-related slowdown, and it made no measurable difference at all. :frowning:

Also: just in case it was a problem with context switching while the processor was executing a dsb, I added a load of nops to the yield macro (between the write to the NVIC PendSV bit and the dsb/isb). From that, it seems that the actual context switch is typically happening ~12 cycles after the write to the NVIC PendSV bit. But that made no difference to the timing behaviour.

Hartmut Schaefer: I’m executing all this code from ITCM, so I’m not expecting any QSPI-related slowdown here. I went to some effort to wrap Segger SystemView Enter/Exit Isr markers around all my ISR handlers, so I’d be a little bit surprised if there’s a stray ISR here. But it’s still possible I’ve missed something (though I can’t see where). :frowning:

Incidentally, one mysterious coincidence is that when the SystemTick fires, it always happens immediately at the end of the tcpip_thread’s delay period, pretty much exactly at the time you’d expect to see something else waking up the processor. But… this makes no obvious sense at all.

image

If this isn’t a coincidence, then it seems that there are two mysterious things here: firstly, a task that is sleeping after a context switch, and secondly something inside the PIT (i.e. not just the SysTick) waking the processor up from that sleep. But as I say, this makes no obvious sense.

Hi @hs2 ,

If the data originated from the network interface is it not possible (assuming the likely scenario, it came from a DMA interface) the DMA operation was followed by a cache invalidate and the DSB instruction might be waiting on the invalidation to complete? This is a hypothesis based on not knowing the in and outs of the Cortex M7 architecture. :blush:

Just a thought…

cc @nickpelling

Kind Regards,

Pete

Pete Bone: another good question! Alas, in this instance, the device isn’t receiving lots of packets, but rather sampling several channels of input data and then trying to publish that in frequent small packets (i.e. 4000 per second).

Which is why a 70-90usec delay per context switch quickly mounts up, particularly when you’d probably only expect 2-4usec. :frowning:

Now, I can’t help but suspect that the problem I’m looking at is to do with the Cortex M7 unexpectedly being in a sleep state. So I looked once again at the ARMv7-M Architecture Reference Manual.

Section B1.5.17 (“Power management”) mentions SLEEPDEEP and SLEEPONEXIT bits (these are bits 1 and 2 of SCB->SCR). These control when the processor automatically drops into a deep sleep, e.g.:

This configuration means that, on an exception return, if no exception other than the returning exception is active, the processor suspends execution without returning from the exception.

However, when I looked at the contents of the SCR register, all these bits were set to 0 already. So I’m still out of ideas as to what is causing this problem. :frowning:

Do you have access to a reasonably compatible M4 machine that you can run the same minimal application against?

One solution to check for a sleep problem is disable any code that can go into the sleep modes or can set those bits

RAc: sadly, I only have access to M7 devices here. :frowning:

richard-damon: I’ve already looked for all instances of WFI / WFE / SEV in the code, and found none, alas.

I’m now wondering whether the problem might be to do with the i.MX RT106x processor, because this has specific integrated logic devoted to low-power operation, e.g. CCM->CCR, CCM->CLPCR, PMU->MISC0. That is, it could be that something is causing the chip to think that it needs to go into WAIT mode or STOP mode, where the cpu clock gets gated off for a while.

So, I’m now raking through the RT106x reference manual to see if there’s any way I can force it to stay in RUN mode in all situations. It’s a bit of a long shot, but what’s a dev guy to do, eh?

Hi @nickpelling ,

Again if this is not relevant please ignore :slight_smile:

If you have access to an oscilloscope…
Is there anyway you can toggle a hardware gpio pin either side of dsb and isb instructions with a temporary modification to the source code, so you can measure the time taken with an oscilloscope on the gpio pin? This would eliminate those instructions from your enquiries at least…

This is the next thing I would try…

Kind Regards,

Pete

1 Like

Pete Bone: that wouldn’t be easy with the setup I have, but if I add a load of nops to move it further down, I should be able to use DWT->CYCCNT to time dsb/isb. I’ll give that a go now…

1 Like