Task stops running without reason ?

ammaree wrote on Wednesday, January 13, 2016:

Using CC3200 with v8.2.3 with 7 user tasks. all running fine for a long time. Changed NTP from being a single function call at start to become a very simple independant task.

Task interval starts at 1min, incrementing every time it is run with an extra minute until the interval is 16 minutes. What I find is that the task stops running after a random period or number of runs, sometimes after 1, 5 or 10 events, the last time ran 95 times over about 22 hours before stopping.

All other tasks keep running with intervals ranging between 20mS and 60sec,

I have looked at the task status (2=Blocked) all looks fine

Every NTP event shows in syslog at the correct incrementing interval, right until when it stops.

No errors are recorded by the very simple task, anywhere along the execution path.
No sense in a breakpoint on the return from vTaskDelayUntil() sine it does not return

Any suggestions where to start looking?

rtel wrote on Wednesday, January 13, 2016:

There is always a reason, its just that sometimes we don’t know (at first!) what it is :o)

The only part of the software you are running that is provided by ourselves is the RTOS - so I’m afraid we cannot even begin to consider what might be going on in the NTP software. Comments will therefore necessarily be limited to trying to determine what the task is doing, rather than what the NTP is doing.

Have you read through the first page of the FAQ? http://www.freertos.org/FAQHelp.html In particular the points about setting interrupt priorities? Interrupt priorities are the normal source of such issues. Do you have configASSERT() defined, as that will catch most interrupt misconfiguration issues. http://www.freertos.org/a00110.html#configASSERT

If its not the above, then read on:

You imply that the task enters a call to vTaskDelay() but never leaves it again. If that is the case then we need to look deep into the internal structures so we can work out why that might be. It could be that another task has corrupted something, or that the task in question does not have the priority of any interrupts it is relying on set correctly. Let me know if this is the case and I can assist further.

Are you sure it is inside the vTaskDelay() function? If not, then read on again:

Some of the demos we provide ourselves using our own stack (http://www.freertos.org/tcp) run multiple tasks that make use of the TCP/IP stack. We have in the past seen situations where one task will apparently stop, while all the other tasks that are using the TCP/IP stack continue functioning correctly. When debugging this we have found that the faulty task has got stuck in driver code (not written by us I hasten to add ;o) that is performing a polling operation without a proper time out. Even when the poll operation is using a blocking call, so does not consume all the CPU time, it just effectively deadlocks. Is your task using a driver (maybe a clock, or something similar) that is not used by any other task?

ammaree wrote on Wednesday, January 13, 2016:

configASSERT is enabled and working, have tested by changing some priorities to trigger.
definately not returning from vTaskDelay(), set breakpoints to check.
The task runs perfectly, updates time and returns with no error. All socketes are closed at the end of the update, the task run counter is update,an NTP update message message is sent, then the task goes to sleep again.

If any error occurs, the task error counter is incremented and the error time, code, module and line is logged to syslog.

The example when it ran 95 times the sleep periods (after the 1st run) were:
1->2->3->4->5->6->7->8->9->10->11->12->13->14->15 then 79x 16sec over a ~22hr period.

When I checked the task run counter was 95, the task error counter 0 and the syslog for this device showed no errors after the last successful NTP entry.

Problem is it happens very randomly, so the question is when it happens, where do I start looking…

rtel wrote on Thursday, January 14, 2016:

You said in your first post that the task in question was reported as being in the Blocked state, so the first thing we need to do is determine what time the task is expecting to unblock. This is a little fiddly.

  1. When you create the task save the task’s handle in a variable that can be viewed in the debugger. Lets assume the variable is called MyTask.

  2. Run the code for a short while, then pause the debugger.

  3. Inspect the MyTask variable in whichever window your debugger provides for this (watch/symbols/expressions/whatever). You want to see the variable as a structure - if the debugger instead shows it as a void pointer with a single address, then cast the variable to a tskTCB pointer, as shown in the attached image.

  4. Expand the structure in the debugger, you are looking for the xGenericListItem->xItemValue member, again as shown in the image attached (the red number). Between calling vTaskDelayUntil and exiting the vTaskDelayUntil function the value highlighted is the tick time at which the task should leave the blocked state.

  5. Once you are confident you can inspect the structure successfully, start the debugger running again, and wait for the error to occur. Once the error has occurred again pause the debugger and report back both the following values:

  • MyTask->xGenericListItem.xItemValue
  • xTickCount
  • MyTask->xGenericListItem.pvContainer
  • pxDelayedTaskList
  • pxOverflowDelayedTaskList

ammaree wrote on Thursday, January 14, 2016:

Apologies, but cannot find the attached image…
What I have done is below…

void	vNtpTask(void *pvParameters) {
_i32	iRetVal ;
TaskHandle_t *	pMyTask ;
	pMyTask = TaskInfo[TASKIDX_NTP].TaskHandle ;
	f64MinMaxAvgInit(&mmaNtpRatio) ;
	TaskInfo[TASKIDX_NTP].LastWakeTime	= xTaskGetTickCount();    // Get the current time as a reference to start our delays.
    TaskInfo[TASKIDX_NTP].Interval		= 0 ;
	while (true) {
	// first wait till IP is up and running
		vIpStackWaitTillRunning() ;
		if ((iRetVal = xNtpGetTime(&CurSecs)) >= 0) {
			xReportEvent(EVENT_TIME_UPDATE_NTP) ;
			TaskInfo[TASKIDX_NTP].TaskCount++ ;
			if (TaskInfo[TASKIDX_NTP].Interval < NTP_TASK_PERIOD_MS_MAX) {
				TaskInfo[TASKIDX_NTP].Interval += NTP_TASK_PERIOD_MS_MIN ;
			} else {
				TaskInfo[TASKIDX_NTP].Interval = NTP_TASK_PERIOD_MS_MAX ;
			}
		} else {
			LOG_ERROR(iRetVal) ;
			xErrorHandlerAlert() ;
			TaskInfo[TASKIDX_NTP].Interval = NTP_TASK_PERIOD_MS_MIN ;
		}
		vTaskDelayUntil(&TaskInfo[TASKIDX_NTP].LastWakeTime, TaskInfo[TASKIDX_NTP].Interval) ;
	}
}
Using CCS 6.1.2.14 but cannot do the cast to (tskTCB *)

Tried with pMyTask as both local and global variable.
Get "Error: cannot load from non-primitive location"

rtel wrote on Thursday, January 14, 2016:

That is because I forgot to attach it :o) It is attached to this post.

ammaree wrote on Thursday, January 14, 2016:

Got it… results attached below

rtel wrote on Thursday, January 14, 2016:

Looks good - now we just need the number after the issue has occurred.
The idea is to see when the task thinks it should be waking up, compared
to the current time.

ammaree wrote on Thursday, January 14, 2016:

Sorry hold a bit, need to get the xTickCount and the 2 task lists as well.
How shall I capture the 2 tasks lists, what level of detail…

rtel wrote on Thursday, January 14, 2016:

Don’t forget to expand the xGenericTaskList structure in the debugger, so I can see the values of its members.

The pxDelayedTaskList and pxOverflowDelayedTaskList values do not need to be expanded. I just want to know their values - that is - what they are pointing to.

Some explanation:

When the task blocks the xItemValue member is set to the time at which the task needs to unblock, and the task is placed into a delayed task list. There are two delayed task lists, one for wake times that have not wrapped around, and one for wake times that have wrapped around.

The information you are going to provide will tell me the time at which the task should be unblocked, and the current time. To complete the picture I then need to know if the time at which the task should unblock has wrapped from the current time, or not - that information comes from MyTask->xGenericListItem.pvContainer value - which shows which delayed list the task is in.

ammaree wrote on Thursday, January 14, 2016:

Here we go…

rtel wrote on Thursday, January 14, 2016:

Unfortunately you have expanded the xEventListItem rather than the
xGenericListItem. Could you take another snapshot please.

ammaree wrote on Thursday, January 14, 2016:

Apologies…

rtel wrote on Thursday, January 14, 2016:

What this tells us:

  • The task should unblock when the tick count is 180838.
  • The current tick count is 1099648.
  • The that is in the overflow delayed list.

The fact that the unblock time is less than the current tick count means
the task should be in the overflow delayed list - which indeed it is.

As it is, the task will unblock after the tick count has reached
0xffffffff (assuming a 32-bit tick count), wrapped back to 0, and then
counted up to 180838 - which will be a very long time into the future
(assuming a reasonable tick frequency).

So at the moment it appears as if, when the task entered the blocked
state, it asked for an extraordinarily long block time. Could this be
an arithmetic overflow problem in your code - by which I mean, are you
holding times in signed variables instead of unsigned variables? The
TickType_t type should be used (which in your case is presumably
equivalent to uint32_t).

I presume you are confident the tick rate is running at the expected
value, because you have been observing the time at which the NTP actions
are occurring, but what is the configTICK_RATE_HZ value?

ammaree wrote on Thursday, January 14, 2016:

Just for info, in this instance the task only ran twice, once at startup and another time after 1 minute.

The task was supposed to wake up and run after another 2 minutes, ie after a total of 3 minutes or ~ 180k ticks

ammaree wrote on Thursday, January 14, 2016:

Tick rate is running at 1000Hz, I use the tick time to update both UpTime (sec and msec from boot) and CurSecs (Unix epoch sec and msec) values.
Both Uptime and CurSecs run VERY close to actual NTP time, checked regularly.
Both times are used in a rules engine that performs cloud controlled tasks that execute at 10mS accuracy.

The 2 macros used to control the (escalating) time delay are below:

#define 	NTP_TASK_PERIOD_MS_MIN					(1 * 60 * 1000)
#define 	NTP_TASK_PERIOD_MS_MAX					(16 * 60 * 1000)

Both are stored in a 24 bit wide _u32 value, structure below

typedef	struct task_info {
	TaskHandle_t		TaskHandle ;
	QueueHandle_t		QueueHandle ;
	TimerHandle_t		TimerHandle ;
	TickType_t			LastWakeTime ;
	volatile	_u32	TaskCount ;
	volatile	_u32	QueueCount ;
	volatile	_u32	TimerCount ;
	volatile	_u32	Interval 	: 24 ;
	volatile	_u16	TaskError	: 12 ;
	volatile	_u16	QueueError	: 12 ;
	volatile	_u16	TimerError	: 12 ;
	volatile	_u8		TaskState	: 4 ;
} TaskInfo_t ;

The value in my structure is very clearly still 2min.

33m19s318: --Name--|F#|Pc/Pb| S |LoStk|%Util|TimerTicks|TaCount|QuCount|TiCount|TaErr|QuErr|TiErr|Ph|Intvl|
33m19s319:      MAC| 1| 7/ 7| 2 |  260| 0.00|       262|    211|    211|      0|    0|    0|    0| 0| 0
33m19s320:      Con| 2| 1/ 1| 1 |  176| 0.01|      2740|   9993|      5|      0|    0|    0|    0| 0| 0s000
33m19s321:      IP4| 3| 5/ 5| 2 |  440| 0.03|      7837|   2000|      0|      0|    0|    0|    0| 5| 1
33m19s321:      CRx| 4| 3/ 3| 2 |  216| 0.00|       297|      9|    662|      0|    0|    0|    0| 0| 1
33m19s322:      CTx| 5| 2/ 2| 2 |  432| 0.00|       681|    662|    662|   1997|    0|    0|    0| 0| 1
33m19s323:      Eng| 6| 6/ 6| 2 |  936| 0.00|      1175|   2001|      0| 199931|    0|    0|    0| 0| 1
33m19s324:      Sen| 7| 7/ 7| 2 |  472| 0.12|     24385| 195608|      0|      0|    0|    0|    0| 0| 0s000
***33m19s325:      NTP| 8| 1/ 1| 2 |  272| 0.00|        89|      2|      0|      0|    0|    0|    0| 0| 2m00***
33m19s326:     IDLE| 9| 0/ 0| 1 |  396|99.35|  19829905|
33m19s326:  Tmr Svc|10| 2/ 2| 2 |  544| 0.45|     90524|

Will check the delay calculations but it is very simple and has in some cases run for almost 24 hours before skipping a slot.

rtel wrote on Thursday, January 14, 2016:

I should have asked at the beginning - what do you have configCHECK_FOR_STACK_OVERFLOW set to? Also, how big is the interrupt stack?

[configCHECK_FOR_STACK_OVERFLOW only checks the stack used by tasks, but the stack in use switches to the interrupt stack each time an interrupt is entered. The stack used by interrupts is the same stack as used by main() before the scheduler is started, so it is set up by your tools somewhere)]


From your last post I can infer the tick rate is 1KHz.

Interesting. So we are expecting xItemValue to be approximately 180K, and indeed it is, so appears to have been calculated correctly. Why then, is it in the overflow list? Hmm…thinking. Something is very odd - the code that determines which list to put the task into is very simple:

	if( xTimeToWake < xTickCount )
	{
		/* Wake time has overflowed.  Place this item in the overflow list. */
		vListInsert( pxOverflowDelayedTaskList, &( pxCurrentTCB->xGenericListItem ) );
	}
	else
	{
		/* The wake time has not overflowed, so the current block list is used. */
		vListInsert( pxDelayedTaskList, &( pxCurrentTCB->xGenericListItem ) );

…and we can already see that xTimeToWake was calculated correctly, as it is shown in the screen shot.

Normally at this point I would ask for a cut down project, with just enough code to demonstrate the issue, so I could run it here - but in this case I don’t think that is going to be possible as you are using the NTP code.

How about this - could you update your task so the task does not call any of the NTP code, but instead just increments a volatile variable every time it runs. So you keep the structure of the task the same, the bit that calls vTaskDelayUntil() and the bit that calculates the next wake time should remain unchanged, but the bit that calls the NTP code is commented out and replaced with a simple variable increment. Then run the code again. If the problem still exists then we would have something I could try running here.

rtel wrote on Thursday, January 14, 2016:

Our two last posts crossed over each other. Now replying to your post that states the NTP_TASK_PERIOD_MS_MIN value.

Your NTP_TASK_PERIOD_MS_MIN and NTP_TASK_PERIOD_MS_MAX values are defined as signed integers - a long shot - but could cause a problem. Can you add ‘ul’ to the end literal, as (1UL * 60UL * 1000UL)

The 24-bits is also to be of suspicion.

ammaree wrote on Thursday, January 14, 2016:

Have already added UL to all 3 components, made no difference. Falied after 2 executions.

#define 	NTP_TASK_PERIOD_MS_MIN					(1UL * 60UL * 1000UL)
#define 	NTP_TASK_PERIOD_MS_MAX					(16UL * 60UL * 1000UL)

24 bits should handle ~280min, so we should be very safe…

ammaree wrote on Friday, January 15, 2016:

thanks for the help, no bug anywhere but rather an obscure NTP issue. Some “holier than thou” servers reserve the right to simply ignore a request, no warning. When using blocking socket option the cc3200 IP stack block with the rtos waiting for a response.

Just removing these specific hosts from the list removed the problem but I will change the code to use nonblocking socket and handle the EAGAIN and EWOULDBLOCK returns.