I am training a resnet-18 network in pytorch with a standard pytorch dataloader. When I increase the number of workers from 0 (no parallelization) to any number (even just to 1), the time for loading a batch increases by a large factor (~x5), instead of decreasing. My CPU utilization is low, and the images are loaded from my local SSD.
I have 12 cores.
With num_workers=0, the utilization is 5-6 cores@100% and the rest @10-15%
With num_workers=1, the utilization is @50-60% on all cores.
With num_workers=2, the utilization is @60-70% on all cores.
I am measuring time by profiling with NVIDIA nsight-systems
In theory mum_workers 0 runs everything on the main thread which should be faster than using a single worker. Di you find the same behaviour using a great amount of cores like 10 or 20? (Note that you usually set amount of threads not cores which is usually twice as many cores as your cou has)
I am facing the same problem. I didn’t know writing the for loop differently would affect the data loading speed so much, and I have been spending hours trying to figure out what is wrong with higher number of workers.