Hi everyone,
I have a problem when running several jobs on one machine (one job per GPU).
With only one job, the training goes well if I use the pin_memory=True
option of the Dataloader, the batches are provided on time, the gpu utilisation is high. I use 8 workers.
If pin_memory=False
, the main process waits a bit for the dataloading on some batches, but it is not too bad.
However, if I start a second job, the dataloading “freezes” for every 8th batch (I have 8 workers) for as long as one minute, slowing down the training dramatically. I timed the __getitem__
and collate
functions and timing stays short and constant for those functions. It is therefore not a problem of disk throughput or i/o, nor processing.
I found during the “freezes” that all the dataloading workers processes are dominated by Kernel Threads (all cpus are displayed in red color on htop).
Turning pin_memory=False
does not solve the issue.
So if it does not come from the __getitem__
nor collate
functions, where is all that time spent?
What is going on after one batch has been processed by a worker and before it can be consumed in the main process?
Could you recommend some monitoring tools to find out where the problem comes from?
I am working on a GCP linux virtual machine and my data is stored on a SSD drive.
Thanks for your help.