My ISR is adding a message to a task’s message queue, and then using portYIELD_FROM_ISR() to wake the task up. After the ISR exits, the Scheduler fires up and wakes the task up properly. However, what is completely unexpected is that the task then waits (or appears to wait) for around 70-90usec before actually reading stuff from its message queue. (And it’s doing this every time.)
Here’s what it looks like in Segger SystemView (you can see the orange ISR waking up the turquoise task (“pxHigherPriorityTaskWoken”), but the task stays pending until the Scheduler actually runs it):
I’m going crazy here, because I just can’t figure out what the task is doing in that 70-90usec gap between the scheduler starting the task and the task actually reading from its message queue. (As far as I can see, there should be no gap there at all.)
Any suggestions as to what might be happening here? It’s an ISR and a task, the task is just waiting on its message queue when the ISR gets triggered, so there should be no additional task interactions. I haven’t found any remotely similar problem on this forum.
I’m not sure how to read that trace as it appears to shows the task blocking after 81.2us, so presumably it has done two queue reads and one queue write in the mean time, but the queue reads and writes are not show in the graphic - only the text. Am I mis-reading?
What’s happening in the diagram is that the (orange) ISR is posting a message to the (cyan) tcpip_thread’s message queue (starting up the tcpip_thread, “Task Ready”), and then exiting (“Isr Exit”). The next thread to wake up is the scheduler (white, in the middle), which then also exits (“ISR Exit”) while starting the (cyan) tcpip_thread (“Task Run”).
The tcpip thread is then pausing (somewhere, somehow) for 73 usec, before grabbing the message from the queue (the first “xQueueGenericReceive”). Because counting semaphores are implemented as Queues, locking and unlocking the tcpip mutex then appears as “xQueueGenericReceive” and “xQueueGenericSend”. So the actual work being done by the tcpip task to process the message is taking about 2usec (between the lock and unlock). Note that the three queue actions appear as three tiny vertical lines towards the right hand end of the cyan task bar.
So, my question is what is going on for the 73usec between the tcpip thread waking up (“Task Run”) and the tcpip_thread actually getting the message off its message queue (“xQueueGenericReceive”).
From my perspective, it appears that the tcpip_thread might be waking up in a WFI or WFE state, and is then spending some time coming out of sleep mode before actually executing. Though that sounds a bit mad, the only other mechanism I can think of that might be holding the processor back would be a hardware fault causing continual interrupts (which sounds even madder).
So… what gcc attribute(()) should I be using for IRQ handlers? In terms of ((irq)), ((naked)), ((interrupt)) or whatever. I can imagine if an IRQ handler gets the processor status wrong on exit, it might cause all manner of practical mayhem, such as unexpected sleeping etc.
Assuming that the tcpip_thread is the LWIP thread, it is likely blocked in this function in one of the xQueueReceive calls. Is it possible to emit an event right after these xQueueReceive calls which will be visible on this plot? This will tell us whether the time is spent in the xQueueReceive or after that?
I’ve added some extra markers, just so you can see exactly what I’m seeing:
On the left, you can see a pair of markers wrapped around vTaskSwitchContext(): this is called by xPortPendSVHandler() in portable/GCC/ARM_CM4F/port.c, i.e. the PendSV ISR handler. The code either side of that is just plain linear ARM assembly code, pushing and popping registers to do the context switch. So, this happening exactly when the tcpip_thread task is being switched in, just as you’d hope.
On the right, wrapped either side of the first small vertical line (the xQueueGenericReceive) is a second pair of markers I wrapped right at the top of the loop in xQueueReceive():
What has been called immediately before that is portYIELD_WITHIN_API(), which just sets the PendSV flag, triggering the PendSV IRQ Handler.
In summary, I can see the PendSV handler executing OK (at the left) and I can see the tcpip_thread executing OK (on the right), but I just can’t work out what is happening in the 70-90usec between the two.
Part of me wonders whether this is a Cortex M7 chip bug, i.e. whether setting the PendSV bit is somehow deliberately placing the (non-supervisor) registers in a WFE state (or something like it), which is then being resumed into once the PendSV code exits. (I’m presuming that the PendSV code executes in Supervisor mode, but I may be wrong.) If this is right, then the workaround would be for the PendSV handler to clear that task’s status bit before exiting. But… that sounds a bit crazy.
i.e. two opcodes after the portYIELD_WITHIN_API() call completed.
The branch at the bottom of this sequence (the “b.n”) jumps to the top of the loop I cut and pasted above. The first thing there is a call to vPortEnterCritical() which compiles down to about 20 opcodes at -O0; and then there’s the Segger marker I added in. If I count the opcodes between the bx r14 and the Segger marker, it’s no more than 30 (at 600MHz).
If this is not helpful I apologise for the noise
I’m no expert on ARM assembly but I believe the DSB and ISB instructions are probably the issue as they will potentially take a while to complete based on the state of the data and instruction caches.
Right, they may take a while to complete but I don’t think they take N*10 us.
It’s really strange… also given that you have caches enabled it’s also unlikely that a lengthy code fetch from flash hits you exactly there systematically
BTW Do you run from internal flash or RAM ? Also do you have non-FreeRTOS covered interrupts/ISRs running ?
Pete Bone: it’s a good question, thanks! In fact, in one of my tests I tried not writing to memory in the preceding ISR to see if there was any kind of dsb-related slowdown, and it made no measurable difference at all.
Also: just in case it was a problem with context switching while the processor was executing a dsb, I added a load of nops to the yield macro (between the write to the NVIC PendSV bit and the dsb/isb). From that, it seems that the actual context switch is typically happening ~12 cycles after the write to the NVIC PendSV bit. But that made no difference to the timing behaviour.
Hartmut Schaefer: I’m executing all this code from ITCM, so I’m not expecting any QSPI-related slowdown here. I went to some effort to wrap Segger SystemView Enter/Exit Isr markers around all my ISR handlers, so I’d be a little bit surprised if there’s a stray ISR here. But it’s still possible I’ve missed something (though I can’t see where).
Incidentally, one mysterious coincidence is that when the SystemTick fires, it always happens immediately at the end of the tcpip_thread’s delay period, pretty much exactly at the time you’d expect to see something else waking up the processor. But… this makes no obvious sense at all.
If this isn’t a coincidence, then it seems that there are two mysterious things here: firstly, a task that is sleeping after a context switch, and secondly something inside the PIT (i.e. not just the SysTick) waking the processor up from that sleep. But as I say, this makes no obvious sense.
If the data originated from the network interface is it not possible (assuming the likely scenario, it came from a DMA interface) the DMA operation was followed by a cache invalidate and the DSB instruction might be waiting on the invalidation to complete? This is a hypothesis based on not knowing the in and outs of the Cortex M7 architecture.
Pete Bone: another good question! Alas, in this instance, the device isn’t receiving lots of packets, but rather sampling several channels of input data and then trying to publish that in frequent small packets (i.e. 4000 per second).
Which is why a 70-90usec delay per context switch quickly mounts up, particularly when you’d probably only expect 2-4usec.
richard-damon: I’ve already looked for all instances of WFI / WFE / SEV in the code, and found none, alas.
I’m now wondering whether the problem might be to do with the i.MX RT106x processor, because this has specific integrated logic devoted to low-power operation, e.g. CCM->CCR, CCM->CLPCR, PMU->MISC0. That is, it could be that something is causing the chip to think that it needs to go into WAIT mode or STOP mode, where the cpu clock gets gated off for a while.
So, I’m now raking through the RT106x reference manual to see if there’s any way I can force it to stay in RUN mode in all situations. It’s a bit of a long shot, but what’s a dev guy to do, eh?
If you have access to an oscilloscope…
Is there anyway you can toggle a hardware gpio pin either side of dsb and isb instructions with a temporary modification to the source code, so you can measure the time taken with an oscilloscope on the gpio pin? This would eliminate those instructions from your enquiries at least…