Thank you for your replies.
I looked into suggestions, but we stumbled upon the solution elsewhere.
Problem Explanation
By default, in the RT1176, the program’s code is stored in QSPI (external) FLASH memory. This FLASH has a clock speed of 133MHz. The connection between this FLASH and the processor is only 4 bits wide, so it takes 4 sequential accesses to the FLASH to retrieve one instruction (16 bits). The rate at which we can retrieve instructions from the FLASH is thus 33.25MHz. This is much slower than the (industrial) processor’s 792MHz clock. The processor thus heavily relies on ARM’s ‘performance-enhancing’ instruction ‘speculative accesses’ (explanatory link in previous reply) to reduce the effects of this bottleneck. This feature copies the processor’s ‘predicted’ instructions into its level 1 cache, so that they are immediately available if the processor requires them.
However, these ‘speculative accesses’ are inaccurate when interrupts occur, since the processor cannot predict when interrupts will occur. Therefore, whenever an ISR exits, “__DSB()” and/or “__ISB()” is called, and the processor’s instruction and data pipelines are flushed.
Solution
Code that is stored in the RT117’s SRAM_ITC is never copied into cache. Since it is ‘tightly-coupled’ to the processor (and can thus be accessed synchronously, with no delays), there is no benefit if its contents is cached. Consequently, the ARM processor does not apply its ‘performance enhancing’ speculative accessing techniques, or any other ‘optimisations’ on this code. Therefore, there are less/no instructions in the processor’s pipeline when returning from an ISR, so this delay is minimised.
Therefore, this problem is solved by storing as much of the ISR code as possible (preferably all) in the SRAM_ITC memory. This is in line with the recommendation found in the conclusion of NXP’s application note on the ARM Cortex-M7’s L1 cache.