Stack overflow check reliability

ilgitano wrote on Thursday, September 04, 2014:

Using IAR/MSP430/FreeRTOS7.6.2

When using configCHECK_FOR_STACK_OVERFLOW == 1 and heap_1 malloc code, stack overflow checks don’t catch stack overflows.

The reason is that the ordering of mallocs in prvAllocateTCBAndStack results in the tasks control block being immediately below the stack. On MSP430/ARM the stack grows down, so the first casualty of a stack overflow is the tasks control block, in particular trashing the pxStack/pxEndOfStack fields.

taskFIRST_CHECK_FOR_STACK_OVERFLOW cannot work, since it uses pxStack/pxEndOfStack in the test.

Additionally, since the TCB is trashed, when the stack overflow callback DOES get triggered, the task name is trash.

I’ve made some minor changes in my local code base to fix this - in prvAllocateTCBAndStack. I have submitted a bug (#93) which was closed-rejected. Perfectly reasonably as I just submitted my patch with a description of the fix, no description of the problem. I have been instructed to reopen the bug with a clearer explanation.

How do I reopen a bug? And does anyone think its worth it?

rtel wrote on Thursday, September 04, 2014:

Hi - thank you for your feedback. You raise an issue that we have had quite some debate about, and I understand your comments better now having read them again here rather than originally in the bug report.

The issue is that a member of the TCB that is used in the stack overflow calculation has a high probability of itself having been overwritten:

This issue was the reason the second method of stack overflow detection was added. The second method is not trying to compare addresses, but compare what it finds at that addresses, so has more than a 50/50 chance of working if the values it uses in the test are themselves corrupt. The second method is not fool proof though as it could in turn just cause a crash if the pxCurrentTCB->pxStack value points to invalid memory. That risk is minimised as pxCurrentTCB->pxStack is only read, not written to, but the risk still exists. It is also much less efficient (but will always be more reliable as it will catch most overflows that occur at any time, rather than just when the context is saved).

So then the question is what to do about this, and several options have been played with.

The first is, as you say, to place the TCB at the other end of the stack. Our thinking on that was that if a stack overflow did occur, and the TCB was at the other end of the tack, then the overflow will instead clobber the TCB of a task other than the task that actually overflowed its stack (assuming no queues or semaphores etc. existed between the two tasks) - and the general conclusion was that it was not a good idea to corrupt a task that was actually fully functioning without error. The result was you effectively had two tasks that were now bad rather than just one - which might be ok - but you had to try and determine which task is was that the corrupted TCB actually belonged to - where as with the code as it is at the moment you already new that as pxCurrentTCB is point to it.

The second solution that was toyed with was to store the limit of the tack (the value that is needed to detect an overflow) at the start of the stack itself - so at least that value had a higher probability of not being corrupted. That has a couple of problems in itself though - the first is additional RAM is required and additional TCB members are required in order to locate the value, and the second is it is not a portable solution (in that each port layer would need updating individually) and that is somewhat impractical. In any case it is not a fool proof solution as that value will get written over if the corruption is sever enough to overwrite the entire TCB to get to the stack.

A third solution was to actually store the TCB’s somewhere completely different to the stacks, but that has usability issues (more complex for the user).

Having stated the past given rationales, I’m now going to say that I don’t necessarily agree with them, for the following reasons…

Stack overflow detection is extremely useful when developing and debugging. But in a deployed system (in my opinion) it is folly to try and recover from a stack corruption in anything other than a safety critical system because you know RAM has been corrupted, which means you cannot rely upon any decisions the CPU makes as to how to recover from the situation (as highlighted in your post). Safety critical systems in turn, at least those using SafeRTOS, would normally protect the stack using an MPU (memory protection unit). The MPU allows you to detect a stack corruption before it occurs, rather than after it occurs, meaning the TCB was not corrupted, and safe in the knowledge that the memory is still in tact, allow an orderly recovery.

So if in a non MPU system it is not advisable to attempt recovery, and in an MPU system you are protected from such corruption anyway, why not move the TCB to a position such that a stack corruption in one task corrupts the TCB of another task - nothing is really lost and the end user gets better detection and feedback when the detection indicates an error.

Finally we are actively pursuing a program of optimisation at the moment and one of the ideas there is to replace the two pvPortMallocs() with a single malloc in both the cases of creating a task and creating a queue (which includes semaphores) and this could be done as part of that task.

I should be able to re-open the bug report for you (or maybe a feature request?) - and will link to this forum post.


ilgitano wrote on Thursday, September 04, 2014:

“Safety critical systems in turn, at least those using SafeRTOS, would normally protect the stack using an MPU”

I was wondering how SafeRTOS dealt with that.

For the rest of it, my thinking is the same as yours, so I’d like it pursued. I’d still call it a bug though, in that taskFIRST_CHECK_FOR_STACK_OVERFLOW cannot work reliably as implemented. But that’s just me being pedantic - if it’s been discussed to death already then I’m happy to call it a feature request.

heinbali01 wrote on Friday, September 05, 2014:

Hi Ilgitano,

It is fatal if TCB blocks get overwritten, but any stack overflow makes debugging difficult. Maybe the stack grows into a data buffer, and when writing to that buffer your stack might get overwritten.

And by the way, xTaskGenericCreate() also has a parameter “puxStackBuffer” which allows you to put the stack anywhere you like.

I’m just wondering, isn’t it possible to temporarily assign each task more than enough stack space and run the application while measuring the “maximum stack usage”?

( which can be measured by counting the “untouched bytes”, i.e. the number of bytes, counted from the bottom which still have the value ‘tskSTACK_FILL_BYTE’ )

In my firmwares, I made a PS command for this, which shows all tasks along with their actual stack usage in percentage and absolute numbers.

The only real safe way would be to use a memory protection unit (MPU) and get a page fault when a stack grows out of its boundary. But still then it is important to have a good estimate of the expected stack usage per task.


PS. this function returns the number of free words in the stack per task:

#if ( INCLUDE_uxTaskGetStackHighWaterMark == 1 )

  uxTaskGetStackHighWaterMark( TaskHandle_t xTask )

ilgitano wrote on Friday, September 05, 2014:

Hi Hein,

We’re using all of those. The issue is that we would like a more reliable system trap when the stack overflows. As RTE explained, this can’t be done perfectly. It can be made a lot more reliable by making sure the critical values needed for the test (in the task control block) are not the first things corrupted in an overflow, by not putting the TCB right at the hot end of the stack.