Issue with uart RX data when sharing with multiple tasks

Context:

I’ve setup running two task with same priority sharing single UART to transmit and receive data on the RS484 bus.

On every data transmission from either of the task the response is expected, UART RX ISR has been setup to receive data.

Issue:

Occasionally, one of the task never gets response for one of transmit as a result eventually the xTaskNotifyWait times out.

what puzzling is the same code give me 100% success rate when device running with single task.

I would like to get my functions verified whether it is set right to handle uart data when sharing with multiple tasks

Any help would be appreciated! Thanks

Code snippets attached for reference:

/**
* @fn          DRV_RS485_Ret_t DRV_RS485_SendReceive(const uint8_t* const pTxData, uint16_t sizeTx, 
*                                                    uint8_t* const pRxData, uint16_t sizeRx, 
*                                                    uint16_t* const pReceivedBytes, uint16_t timeout)
*
*
* @brief       Sends a message over the RS485 network and receives the reply.
*              Timeout specifies the timeout in milliseconds for both the transmission and reception.
*              Function blocks the calling task until all the communication is performed (success or
*              or error).
*               
*
* @param[in]   pTxData: pointer to the transmit buffer;
* @param[in]   sizeTx: number of bytes to be transmitted;
* @param[in]   pRxData: pointer to the receive buffer;
* @param[in]   sizeRx: size of the receive buffer (maximum size of the received message);
* @param[out]  pReceivedBytes: pointer to the variable updated with the number of received bytes;
* @param[in]   timeout: timeout in milliseconds for the transmission and reception operation;
*
* @return      DRV_RS485_Ret_t.
*
* @note        Function uses the calling task events to communicate:
*               - RS485_EVENT_RX  (bit 1);
*               - RS485_EVENT_ERR (bit 0);
*
******************************************************************************/
DRV_RS485_Ret_t DRV_RS485_SendReceive(const uint8_t* const pTxData, uint16_t sizeTx, 
                                      uint8_t* const pRxData, uint16_t sizeRx, uint16_t* const pReceivedBytes,
                                      uint16_t timeout)
{
   assert_param(pTxData);
   assert_param(sizeTx);
   assert_param(pRxData);
   assert_param(sizeRx);
   assert_param(pReceivedBytes);
   
   DRV_RS485_Ret_t ret = DRV_RS485_NO_ERR;
   
   /* take the mutex */
   if (pdPASS == xSemaphoreTake(lpMutex, portMAX_DELAY))
   {
      /* Ensure the calling task does not already have a notification pending by calling
         xTaskNotifyWait() with the events RS485_EVENT_RX and RS485_EVENT_ERR cleared at the exit
         of the function and a block time of 0 (don't block). The current notification value is
         not required, so the pulNotificationValue parameter is set to NULL. */
      (void)xTaskNotifyWait(0, (uint32_t)(RS485_EVENT_RX | RS485_EVENT_ERR), NULL, 0);
    
      xTaskToNotify = xTaskGetCurrentTaskHandle();
      
      TickType_t tickDiff = xTaskGetTickCount() - oldTickCount;
      if (tickDiff < DRV_RS485_DELAY_MS)
      {
         vTaskDelay(DRV_RS485_DELAY_MS - tickDiff);
      }
      ret = DRV_RS485_SendEx(pTxData, sizeTx, pRxData, sizeRx, pReceivedBytes, timeout);
      
      
      /* release the mutex */
      if (pdPASS == xSemaphoreGive(lpMutex))
      {
         /* Mutex properly released */
      }
      else /* error during releasing the mutex: this should never happen ! */
      {
          ret = (DRV_RS485_NO_ERR == ret) ? DRV_RS485_ERR_MUTEX_GIVE : ret;
      }
   }
   else  /* error during taking the mutex: should never happen ! */
   {
      ret = DRV_RS485_ERR_MUTEX_TAKE;
   }
   return(ret);
}



/**
* @fn         static DRV_RS485_Ret_t DRV_RS485_SendEx(const uint8_t* const pTxData, uint16_t sizeTx, uint16_t timeout)
*             
* @brief      Sends the sizeTx data pointed by pTxData over the RS485 network.
*             timeout specifies the timeout in milliseconds.
*             CRC16 will be computed and appended at the end of the message                                 
*             
* @param[in]  pTxData: pointer to the transmit buffer.
* @param[in]  sizeTx: number of bytes to be transmitted.
* @param[in]  timeout: timeout in milliseconds.
*             
* @return     DRV_RS485_Ret_t.
*
******************************************************************************/
static DRV_RS485_Ret_t DRV_RS485_SendEx(const uint8_t* const pTxData, uint16_t sizeTx,
                                        uint8_t* const pRxData, uint16_t sizeRx,
                                        uint16_t* const pReceivedByte, 
                                        uint16_t timeout)
{
   DRV_RS485_Ret_t ret = DRV_RS485_NO_ERR;  
   
   bool errCond = (NULL == pTxData) || (NULL == pRxData) || (0 == sizeTx) || (0 == sizeRx) ||
                  (NULL == pReceivedByte);
   
   if (true == errCond)
   {
      ret = DRV_RS485_ERR_PARAMETER;
   }
   else
   {
      uint32_t event = 0;   
   
      if ((sizeTx >= (DRV_RS485_MSG_MIN_LEN - CRC16_SIZE)) && 
          (sizeTx <= (DRV_RS485_MSG_MAX_LEN - CRC16_SIZE)) )
      {
         memcpy (&lRs485.msg[0], pTxData, sizeTx);
         /* Compute the CRC and append it to the message */
         uint16_t crc = CRC16_Modbus_Compute(pTxData, sizeTx); 
         lRs485.msg[sizeTx]   = (uint8_t)(crc & BYTE_MASK);
         lRs485.msg[sizeTx+1] = (uint8_t)(crc >> EIGHT_VALUE);
         lRs485.bytesToSend = sizeTx + CRC16_SIZE;
         ret = DRV_RS485_Send(&lRs485.msg[0], lRs485.bytesToSend);
         
         if (ret != DRV_RS485_NO_ERR)   /* Transmission initialized properly ? */
         {
            /* Error in HAL layer: return the error code */
         }
         else
         {
            lRs485.bytesToReceive = sizeRx + CRC16_SIZE;
            ret = DRV_RS485_Receive(&lRs485.msg[0], lRs485.bytesToReceive);
            
            if (ret != DRV_RS485_NO_ERR)   /* Reception initialized properly ? */
            {
               /* Error in HAL layer: return the error code */
            }
            else
            {
               bool loop = true;
               while(loop) /* loop avoiding to exit due to the wrong event */
               {
                  /* Wait for transmission complete event or error */
                  BaseType_t xResult = xTaskNotifyWait(0, (uint32_t)(RS485_EVENT_ERR | RS485_EVENT_RX), &event, pdMS_TO_TICKS(timeout));
                  ret = DRV_RS485_HandleEvent(xResult, event); 
                  
                  switch (ret)
                  {
                     case DRV_RS485_ERR_TIMEOUT:
                        HAL_UART_AbortReceive(&huart2);
                        HAL_UART_AbortTransmit(&huart2);
                        loop = false;
                     break;
                     
                     case DRV_RS485_NO_ERR:
                        ret = CheckMsgIntegrity(&lRs485.msg[0]);
                        if (DRV_RS485_NO_ERR == ret ) /* Is msg CRC correct ? */
                        {
                           /* copy the message to the destination buffer without the CRC */
                           *pReceivedByte = lRs485.bytesReceived - CRC16_SIZE;
                           memcpy(pRxData, &lRs485.msg[0], *pReceivedByte);
                        }
                        loop = false;
                      break;
                     case DRV_RS485_ERR_UART_COM:
                        ret = (DRV_RS485_Ret_t)(DRV_RS485_ERR_HAL_BASE + LIB_BIN_CTZ(lRs485.errCode));
                     /* intentional fall-through */
                     case DRV_RS485_ERR_INTERNAL:
                        loop = false;
                     break;
                     
                     default:
                        /* nothing to do: just wait for the proper event */
                     break;
                  }
               }
            }
         }
      }      
      else /* Invalid length of the Tx Message */
      {
         ret = DRV_RS485_ERR_TX_LENGTH;
      }
   }
   return(ret);
}

Where is xTaskToNotify defined? Have you declared it volatile?

Can you add some counters to the ISR and the DRV_RS485_SendEx function to count the number of times ISR is fired and the number of times we send and receive notification. This will help us narrow down if the interrupt is not firing or the notification is not getting delivered to the right task.

You should write something like the following to the ISR -

static volatile uint32_t count_isr_fired = 0;
static volatile uint32_t count_task1_notified = 0;
static volatile uint32_t count_task2_notified = 0;

void ISR( void )
{
	count_isr_fired++;

	/* Perform work. */

	if( xTaskToNotify == first_task_handle )
	{
		count_task1_notified++;
		/* Call xTaskNotifyFromISR to notify first task. */
	}
	else if( xTaskToNotify == second_task_handle )
	{
		count_task2_notified++;
		/* Call xTaskNotifyFromISR to notify second task. */
	}
	else
	{
		/* Must never happen. */
		configASSERT( pdFALSE );
	}
}

Hi Gaurav, It is declared as static in the same file.

/** Task invoking the RS485 driver **/
static TaskHandle_t xTaskToNotify = NULL;

Change it to the following and do the things that I mentioned above -

/** Task invoking the RS485 driver **/
static volatile TaskHandle_t xTaskToNotify = NULL;
1 Like

I’ll do your recommendation, also thought of sharing my current ISR implementation as well

/**
* @fn         void HAL_UART2_RxCpltCallback(UART_HandleTypeDef* const huart);
*
* @brief      UART Rx Transfer completed callback for UART2.
*             The number of bytes received is computed.
*             The event RS485_EVENT_RX is generated.
*
* @param[in]  huart - pointer to the hal uart data structure
*
* @note       this function is weakly defined in HAL_Uart.h and here implemented
*
******************************************************************************/
void HAL_UART2_RxCpltCallback(UART_HandleTypeDef* const huart)
{
   (void)huart;
   assert_param(huart2.hdmarx);
   assert_param(huart2.hdmarx->Instance);
   
   /* Disable the Receive Timeout Interrupt */
   LL_USART_DisableRxTimeout(USART2);
   LL_USART_DisableIT_RTO(USART2);
   
   /* Compute the number of received bytes */
   lRs485.bytesReceived = lRs485.bytesToReceive - (uint16_t)huart2.hdmarx->Instance->CNDTR;
   oldTickCount = xTaskGetTickCount();

   BaseType_t xHigherPriorityTaskWoken = pdFALSE;
   
   xTaskNotifyFromISR(xTaskToNotify, (uint32_t)RS485_EVENT_RX, eSetBits, &xHigherPriorityTaskWoken);
   
   portYIELD_FROM_ISR( xHigherPriorityTaskWoken );
}

This is a bad design to begin with and most probably the root cause of your problem. I bet there is a race condition where the two tasks execute their respective communications in such an interleaved order that one notification gets lost.

A serial communication device by definition and name must have all of the communications over it strictly serialized, so the only way to make your approach work is to “atomize” the accesses, ie use a mutex to ensure that all protocol requirements execute uninterrupted by other tasks.

Since that requirement is not negotiable, you might as well use a single task to serve the device. That task typically acts as a message pump, meaning it receives asynchronous requests from other tasks in a queue and serves those in turn, being the only instance to access your serial port. All stably working drivers for serial devices and protocols I have ever seen in ~30 years function that way. All other architectures are causing little but grief and headaches.

3 Likes

Doing it with two tasks is a bad idea.
Simplify your code with a single thread that handles all ALL UART interaction. Have that thread (usually an interrupt handler) responsible for all UART interaction.
Control access to buffers from other threads with semaphores and locks.
-glen. (42 years of UART programming on embedded platforms)

1 Like

The OP already uses a mutex to serialize. That is why I was not sure why the notification would get lost. Though I agree with you that the code should be simplified to use only one task.

I noticed that the …Ex function checks on a timeout for the notification in its loop and terminates the loop when the timeout sets in, yet it may be the case that the isr asserts the notification right after and may thus get the control flow out of sync. A scope or tracealyzer would help here, also the code for the low level send (without the Ex suffix) appears to be missing.

Although this race condition does not look directly related to the multitasking observation, I personally would not bother to follow the analysis as long as there is reason to belive that a better (sequential single threaded) implementation of the driver does not show the issue.

As @aggarg suggested I’ve added a trace to isolate/narrow down a problem whether it is because of uart Interrupt/notification not getting triggered. It is very hard to reproduce the issue

low-level send and receive uart functions:

/**
* @fn         static DRV_RS485_Ret_t DRV_RS485_Send(uint8_t* const pData, uint16_t size)
*             
* @brief      Sends a message over the RS485 network (non blocking function)
*             
* @param[in]  pData: pointer to the message buffer 
*             
* @param[in]  size: size in bytes of the message
*             
* @return     DRV_RS485_Ret_t;
*             
*****************************************************************************/
static DRV_RS485_Ret_t DRV_RS485_Send(uint8_t* const pData, uint16_t size)
{
   DRV_RS485_Ret_t ret = DRV_RS485_NO_ERR;
   
   if ((NULL == pData) || (0 == size))
   {
      ret = DRV_RS485_ERR_PARAMETER;
   }
   else
   {
      /* Disable the Receive Timeout Interrupt */
      LL_USART_DisableRxTimeout(USART2);
      LL_USART_DisableIT_RTO(USART2);
      
      LL_USART_EnableDirectionTx(USART2);    /* Enable the USART Trasmitter */
      LL_USART_DisableDirectionRx(USART2);   /* Disable the USART Receiver  */
      
      HAL_StatusTypeDef res = HAL_UART_Transmit_DMA(&huart2, pData, size);
      if (res != HAL_OK)
      {
         ret = DRV_RS485_ERR_TX_HAL;
      }
      else
      {
         /* message will be sent by the DMA */
      }
   }
   return(ret);
}

/**
* @fn         static DRV_RS485_Ret_t DRV_RS485_Receive(const uint8_t* const pData, uint16_t size)
*             
* @brief      Receive a message over the RS485 network (non blocking function)
*             
* @param[in]  pData: pointer to the buffer where the receive message will be stored 
*             
* @param[in]  size: size in bytes of the buffer and maximum size of the message
*             
* @return     DRV_RS485_Ret_t;
*             
*****************************************************************************/
static DRV_RS485_Ret_t DRV_RS485_Receive(uint8_t* const pData, uint16_t size)
{
   DRV_RS485_Ret_t ret = DRV_RS485_NO_ERR;
   
   if ((NULL == pData) || (0 == size))
   {
      ret = DRV_RS485_ERR_PARAMETER;
   }
   else
   {
      LL_USART_EnableDirectionRx(USART2);    /* Enable the USART Receiver     */
         
      /* Configure the Receive Timeout Interrupt */
      LL_USART_SetRxTimeout(USART2, DRV_RS485_RECEIVE_TIMEOUT);
      LL_USART_EnableRxTimeout(USART2);
      LL_USART_EnableIT_RTO(USART2);
      
      /* Clear all the error flags */
      SET_BIT(USART2->ICR, UART_CLEAR_PEF | UART_CLEAR_FEF | UART_CLEAR_NEF | UART_CLEAR_OREF);   
         
      HAL_StatusTypeDef res = HAL_UART_Receive_DMA(&huart2, pData, size);
      if (res != HAL_OK)
      {
         ret = DRV_RS485_ERR_RX_HAL;
      }
      else
      {
         /* message will be received by the DMA */
      }
   }
   
   return(ret);
}

The issue has been struck after 4 hours of letting the device run in testing

So the counts look like this, it seems UART RX ISR is disturbing once in a while when multitasks are involved.

count_isr_fired = 83894;

count_task1_notified = 28419;
count_task1_TX_Events = 28419;

count_task2_notified = 55475;
count_task2_TX_Events = 55476;

Could it be possible for hardware interrupts to be blocked by Tasks?

How is this counted and why is it 1 more than ISRs have fired ?
The ISR notification counters are consistent (total/task1/2) as far as I can see, right ?
Do you handle a possible timeout correctly when waiting for a notification ?

Every transmission was also counted using count_task1/2_TX_Events
In this case, the last transmission from task 2 didn’t fire ISR. Breakpoint at the timeout event was hit when ISR was missed for the last event.

No, count_task2_notified is one less than count_task2_TX_Events, every transmission was expected response. whereas task1 counts are head to head

The notification timeout value was configured to 150ms and it is 10X more than the response rate. So this can be ruled out as the actual cause

Ok - I thought it’s the other way round counting the task notification events received from ISR.

One thing that might be worthwhile is adding a hardware monitor on the serial lines to see if the last message (or the one just before it) has a structural issue, or if a reply was given that was somehow lost.

Yes. The OP appears to be writing the master side of a bus communication with several slaves attached. One obvious reason for a communication to fail is a slave not recognizing its address on the bus and thus not responding as expected, which would hint at a problem on the transmit side. A scope with decoding abilities would help a lot here.

Important: Do not wire up your scope/LA to the Logical (MCU) side of the PHY but the physical bus, otherwise you may lose issues related to bus direction switching.

It could also help (the OP) you to add protocol level counting, ie determine how far in a packet reception you have come. Does the problem always show at the beginning of a packet or somewhere in between?

So the actual issue is at master device not firing the ISR. Clearly shows the slave response to the master command from the LA capture on the lines A,B RE/DE from RS485

the below was captured when master received data successfully, it wonders me even though the TX and RX data duration is less than the problematic case it was still received properly

So in either the case the master RE/DE pin was in Receive mode for enough time (>150us) before there was some data from the slave

How does your UART driver handle error conditions (overflow, frame and parity errors)?

It is handled with a error callback and task notification but none of those ever trigger when the packets were missing. So most likely no errors in the uart layer.

/**
* @fn         void HAL_UART2_ErrorCallback(UART_HandleTypeDef* const huart);
*             
* @brief      Error during trasmission or reception via DMA.
*             The error code is stored.
*             The event RS485_EVENT_ERR is generated.
*             
* @param[in]  huart - pointer to the hal uart data structure
*             
* @note       this function is weakly defined in HAL_Uart.h and here implemented
*
******************************************************************************/
void HAL_UART2_ErrorCallback(UART_HandleTypeDef* const huart)
{
   (void)huart;
   
   lRs485.errCode = huart2.ErrorCode;
   
   BaseType_t xHigherPriorityTaskWoken = pdFALSE;
   
   xTaskNotifyFromISR(xTaskToNotify, (uint32_t)RS485_EVENT_ERR, eSetBits, &xHigherPriorityTaskWoken);
   
   portYIELD_FROM_ISR( xHigherPriorityTaskWoken );   
}

Are you in any way influencing the behavior of the ISR (directly in your task/ISR code)? I’m curious is a race condition could exist where you may have to clear/disable/reset the receive ISR.