Recent change in FreeRTOS+FAT appears to have left me with an intermittent bug

pete-pjb · May 17, 2021, 4:07pm

Hi I have been working with latest master of FreeRTOS+FAT and the most recent updates (around 5 weeks ago I believe) may have introduced a problem. I have been getting asserts intermittently since this update and I can not see what the issue is as yet. This only occurs during a read or write to my SD card. I was wondering if there might be any pointers as to where I can check out what is going wrong please?

The Assert:
Assert failed in file E:/Git/Lab-Project-FreeRTOS-FAT/ff_locking.c, line 280

I am working with the Xilinx Zynq platform, from the master branches on git hub I am using FreeRTOS, +TCP, +FAT and EmbedTLS. Up until this update I have not had any issues with my project.

Many thanks,

Kind Regards,

Pete

htibosch · May 17, 2021, 7:26pm

Thanks for reporting this.

You refer to +FAT PR #14, in which:

void FF_UnlockFAT( FF_IOManager_t *pxIOManager )
{
    if( xTaskGetSchedulerState() != taskSCHEDULER_RUNNING )
    {
        /* Scheduler not yet active. */
        return;
    }
    configASSERT( ( xEventGroupGetBits( pxIOManager->xEventGroup ) & FF_FAT_LOCK_EVENT_BITS ) == 0 );
    pxIOManager->pvFATLockHandle = NULL;
    xEventGroupSetBits( pxIOManager->xEventGroup, FF_FAT_LOCK_EVENT_BITS );
}

The patch works well for me, also on a Xilinx Zynq platform.
It would mean that FF_UnlockFAT()is called without calling FF_LockFAT().
Could you check the call stack at the moment it could not assert?
I am curious which function has called FF_UnlockFAT().
Thanks,

pete-pjb · May 18, 2021, 8:07am

Hi Hein,

Thank you very much for responding to me .

The call stack is as follows:

ARM Cortex-A9 MPCore #0 (Suspended)
0x00100a38 vAssertCalled(): cpu.c, line 135
0x0014156c FF_UnlockFAT(): ff_locking.c, line 280
0x0013c038 FF_GetChainLength(): ff_fat.c, line 1192
0x00260100 FF_InitEntryFetch(): ff_dir.c, line 978
0x0025f480 FF_FindEntryInDir(): ff_dir.c, line 337
0x00263020 FF_Open(): ff_file.c, line 298
0x0024c6bc ff_fopen(): ff_stdio.c, line 154
0x00211bd4 get_tles_from_internet_task(): …/src/tle/get_tle.c, line 113
0x00000000
…

This is just one example, I will send more when I can catch another instance.

Kind Regards,

Pete

pete-pjb · May 18, 2021, 9:18am

Hi Hein,

Here is another call stack from a failure at a different point in my application:

ARM Cortex-A9 MPCore #0 (Suspended)
0x001019b0 vAssertCalled(): cpu.c, line 135
0x001f871c FF_Assert_Lock(): ff_locking.c, line 232
0x001f237c FF_getFATEntry(): ff_fat.c, line 391
0x00243c44 FF_GetSequentialClusters(): ff_file.c, line 1123
0x0024432c FF_WriteClusters(): ff_file.c, line 1368
0x002456a0 FF_Write(): ff_file.c, line 2168
0x001f8cdc ff_fwrite(): ff_stdio.c, line 379
0x0013835c execute_cmd(): …/src/ftpd/otgftpd.c, line 721
0x00136440 parse_ftp_ctl(): …/src/ftpd/otgftpd.c, line 199
0x00136164 ftp_connection(): …/src/ftpd/otgftpd.c, line 119
0x00000000
…

I haven’t ruled out my application having some kind of memory corruption which is causing this issue, but right now I have to start somewhere. It could just be coincidence the problem has manifested itself around the time of the +FAT update. There are two separate Zynq devices which communicate with one another using +TCP with EmbedTLS, I am also occasionally getting a TLS transfer failure as well and looking at the error it might suggest this is suffering some kind of memory corruption. The application is pretty large and includes a GUI and a VGA controller which I have built using the fabric of the Zynq, both devices run a variant of the same code base and both devices are seeing the problem. I have checked all my task stacks etc. and all seems well. I will continue to study all my code in the mean time to see if I have botched a pointer somewhere or something. If you feel the +FAT code looks good, no problem, I just wanted to check if I am alone with this issue, in which case it’s very likely I have an issue in my application code somewhere. As you know these type of problems are a bit of a nightmare to locate and resolve so I just need to keep ruling out all possibilities.

Kind Regards,

Pete

htibosch · May 20, 2021, 9:59am

There are two possible sources of corruption for a Xilinx Zynq:

The use of FPU registers by functions like memcpy() and memset(). By default, the FPU registers are not stored on the task stack.
The use of DMA with cache memory: make sure that the contents of the memory is flushed before DMA-writing, and refreshed after DMA-reading.

When you have solved the corruption problem, and if the +FAT assert is still there, it will be good to make a trace of all calls to FF_LockFAT() and FF_UnlockFAT().
But as I said, I can not replicate the problem when running 2 FTP sessions in parallel.
Thanks,

pete-pjb · May 20, 2021, 10:37am

Hi Hein,

Thanks for your reply.

With regard to FPU situation, I have:

#define configUSE_TASK_FPU_SUPPORT				2

in my FreeRTOSConfig.h file.

I am using a lot of floating point maths, along with memcpy() and memset() functions, so it was necessary to use the available floating point hardware for this application. Back when I started the project, a long time ago! I remember this was the necessary soft requirement for using the Zynq floating point hardware. Please can you confirm this is the case and that I have not missed something here? I think if I had this wrong I would have not gotten this far into the project (3 Years!) before I had a problem.

With regard to the cached memory I will double check this, I also have quite a lot of un-cached space configured for various hardware buffers etc. I have a number of DMA peripherals running in the FPGA fabric which I have designed my self. I will go over all of it again and see if I can spot anything which is not correct.

Whilst I am using the FreeRTOS heap4 implementation for heap memory management i.e. all of my code uses the FreeRTOS pvPortMalloc() function. I am also using some C library functions which rely on malloc() free(). So I have a reasonable heap allocated in my linker files and I have implement the following functions to make everything thread safe:

void __malloc_lock(struct _reent *REENT) {

    vTaskSuspendAll();
}

void __malloc_unlock(struct _reent *REENT) {

    xTaskResumeAll();
}

Could you or @rtel perhaps confirm this is sufficient please? (Again I can’t imagine this is not correct as I am sure I wouldn’t have gotten 3 years into the project before I had any problems!)

Thank you for the pointers. (Pardon the pun!)

In the mean time I will continue to look over my code to see if I can see what is causing these issues.

Kind Regards,

Pete

hs2 · May 20, 2021, 12:19pm

I can confirm that your malloc lock implementation works.
I’m using the same when using the libc/newlib heap.

pete-pjb · May 20, 2021, 12:20pm

Many thanks Hartmut.

Kind Regards,

Pete

htibosch · May 25, 2021, 4:02pm

Pete, any news on this?
If you want, I could prepare a version that makes specific logging about the lock and unlock events, and you could test with that?

pete-pjb · May 26, 2021, 8:35am

Hi Hein,

I’m sorry we have been away due to a family bereavement.

If this is not too much trouble for you, it would be very much appreciated!

Kind Regards,

Pete

carlk3 · May 31, 2021, 4:49pm

Do you have

#define configUSE_NEWLIB_REENTRANT 1

in FreeRTOSConfig.h?

pete-pjb · June 1, 2021, 7:19am

Hi Carl,

Yes I do, thanks.

Kind Regards,

Pete

RAc · June 1, 2021, 7:50am

Hi there Pete,

just a shot into the dark:

Have you tried different SD cards? Corrupted SD cards may manifest themselves in infinite ways, and the quality of SD cards on the market covers a very braod range from unusuable trash to industrial quality. With respect to the latter, even there it could be possible that the card has approached the end of its life w/ respect to write cycles.

pete-pjb · June 1, 2021, 8:19am

Hi RAc,

Many thanks for your input, yes I have tried different cards, I use Sandisk 8GB SDHC cards and these have proven reliable over a number of projects with different micros. So hopefully we can rule those out. I really suspect I have a memory corruption issue which is likely not the responsibility of FreeRTOS +FAT but the code for the project is pretty large now and trying to track down such issues as you probably know, is not easy

Kind Regards,

Pete

RAc · June 1, 2021, 8:25am

Ok, sorry for the noise then!

Can you positively confirm that the errors are related to the new middleware (ie roll back the FAT layer to the last known working release and verify that even with the application code changes, the problems don’t show anymore)? Typically this will be hard to impossible, but maybe this is an easy test for you? Just to rule out one of those hardware related things that are bound to drive us bonkers at times…

pete-pjb · June 1, 2021, 10:43am

Hi RAc,

Don’t apologise for making suggestions, all are welcome!

I have attempted a roll back but there has been tool updates since I got to where I am now and it seems to be braking the older version of the project which makes it very difficult to do this…

Anyway I will continue to investigate the issue, as I mentioned before, I suspect it will be down to something I have done, I am still thinking I have probably corrupted a pointer somewhere… I only posted on here on the off chance someone else may have seen an issue with the +FAT update…

Kind Regards,

Pete

carlk3 · June 1, 2021, 2:59pm

Just a shot in the dark, but, it’s easy to swap Dave Nadler’s heap_useNewlib for heap4 (see newlib and FreeRTOS) to try a different heap memory management scheme. If the behavior remains the same, at least you can more or less rule out newlib/heap locking issues.

(I guess you’ve already tried basic things like #define configCHECK_FOR_STACK_OVERFLOW 2.)

pete-pjb · June 2, 2021, 12:35pm

Hi All,

I think I might be a victim of the always unreliable Xilinx development environment, as it turns out.

I believe the Vitis IDE had mangled the board support package code somehow during an update of the FPGA fabric of the two projects concerned after exporting the hardware from Vivado and updating the hardware definition in Vitis I think it had corrupted the code. I have deleted the platform projects and recreated them and all appears to have returned to normal…

I am very sorry to waist every ones time, @htibosch I hope you haven’t spent time doing the special build! Please do not bother now!

Thank you very much for every ones input it is very much appreciated.

Kind Regards,

Pete

cc: @RAc, @carlk3