rtel wrote on Thursday, September 04, 2014:
Hi - thank you for your feedback. You raise an issue that we have had quite some debate about, and I understand your comments better now having read them again here rather than originally in the bug report.
The issue is that a member of the TCB that is used in the stack overflow calculation has a high probability of itself having been overwritten:
This issue was the reason the second method of stack overflow detection was added. The second method is not trying to compare addresses, but compare what it finds at that addresses, so has more than a 50/50 chance of working if the values it uses in the test are themselves corrupt. The second method is not fool proof though as it could in turn just cause a crash if the pxCurrentTCB->pxStack value points to invalid memory. That risk is minimised as pxCurrentTCB->pxStack is only read, not written to, but the risk still exists. It is also much less efficient (but will always be more reliable as it will catch most overflows that occur at any time, rather than just when the context is saved).
So then the question is what to do about this, and several options have been played with.
The first is, as you say, to place the TCB at the other end of the stack. Our thinking on that was that if a stack overflow did occur, and the TCB was at the other end of the tack, then the overflow will instead clobber the TCB of a task other than the task that actually overflowed its stack (assuming no queues or semaphores etc. existed between the two tasks) - and the general conclusion was that it was not a good idea to corrupt a task that was actually fully functioning without error. The result was you effectively had two tasks that were now bad rather than just one - which might be ok - but you had to try and determine which task is was that the corrupted TCB actually belonged to - where as with the code as it is at the moment you already new that as pxCurrentTCB is point to it.
The second solution that was toyed with was to store the limit of the tack (the value that is needed to detect an overflow) at the start of the stack itself - so at least that value had a higher probability of not being corrupted. That has a couple of problems in itself though - the first is additional RAM is required and additional TCB members are required in order to locate the value, and the second is it is not a portable solution (in that each port layer would need updating individually) and that is somewhat impractical. In any case it is not a fool proof solution as that value will get written over if the corruption is sever enough to overwrite the entire TCB to get to the stack.
A third solution was to actually store the TCB’s somewhere completely different to the stacks, but that has usability issues (more complex for the user).
Having stated the past given rationales, I’m now going to say that I don’t necessarily agree with them, for the following reasons…
Stack overflow detection is extremely useful when developing and debugging. But in a deployed system (in my opinion) it is folly to try and recover from a stack corruption in anything other than a safety critical system because you know RAM has been corrupted, which means you cannot rely upon any decisions the CPU makes as to how to recover from the situation (as highlighted in your post). Safety critical systems in turn, at least those using SafeRTOS, would normally protect the stack using an MPU (memory protection unit). The MPU allows you to detect a stack corruption before it occurs, rather than after it occurs, meaning the TCB was not corrupted, and safe in the knowledge that the memory is still in tact, allow an orderly recovery.
So if in a non MPU system it is not advisable to attempt recovery, and in an MPU system you are protected from such corruption anyway, why not move the TCB to a position such that a stack corruption in one task corrupts the TCB of another task - nothing is really lost and the end user gets better detection and feedback when the detection indicates an error.
Finally we are actively pursuing a program of optimisation at the moment and one of the ideas there is to replace the two pvPortMallocs() with a single malloc in both the cases of creating a task and creating a queue (which includes semaphores) and this could be done as part of that task.
I should be able to re-open the bug report for you (or maybe a feature request?) - and will link to this forum post.
Regards.