Dataloading "freezes" when running several jobs

Knossos · February 24, 2021, 2:18pm

Hi everyone,

I have a problem when running several jobs on one machine (one job per GPU).

With only one job, the training goes well if I use the pin_memory=True option of the Dataloader, the batches are provided on time, the gpu utilisation is high. I use 8 workers.
If pin_memory=False, the main process waits a bit for the dataloading on some batches, but it is not too bad.

However, if I start a second job, the dataloading “freezes” for every 8th batch (I have 8 workers) for as long as one minute, slowing down the training dramatically. I timed the __getitem__ and collate functions and timing stays short and constant for those functions. It is therefore not a problem of disk throughput or i/o, nor processing.
I found during the “freezes” that all the dataloading workers processes are dominated by Kernel Threads (all cpus are displayed in red color on htop).
Turning pin_memory=False does not solve the issue.

So if it does not come from the __getitem__ nor collate functions, where is all that time spent?
What is going on after one batch has been processed by a worker and before it can be consumed in the main process?
Could you recommend some monitoring tools to find out where the problem comes from?

I am working on a GCP linux virtual machine and my data is stored on a SSD drive.

Thanks for your help.