I am training a resnet-18 network in pytorch with a standard pytorch dataloader. When I increase the number of workers from 0 (no parallelization) to any number (even just to 1), the time for loading a batch increases by a large factor (~x5), instead of decreasing. My CPU utilization is low, and the images are loaded from my local SSD.
How many cores do you use? Main thread is usually faster. Aditionally, how are you measuring that time? Multiprocessing should be slower at the beggining until it reaches its max speed.
I have 12 cores.
With num_workers=0, the utilization is 5-6 cores@100% and the rest @10-15%
With num_workers=1, the utilization is @50-60% on all cores.
With num_workers=2, the utilization is @60-70% on all cores.
I am measuring time by profiling with NVIDIA nsight-systems
In theory mum_workers 0 runs everything on the main thread which should be faster than using a single worker. Di you find the same behaviour using a great amount of cores like 10 or 20? (Note that you usually set amount of threads not cores which is usually twice as many cores as your cou has)
I am facing the same problem. I didn’t know writing the for loop differently would affect the data loading speed so much, and I have been spending hours trying to figure out what is wrong with higher number of workers.