Num_workers > 0 makes image loading slower rather than faster

I am training a resnet-18 network in pytorch with a standard pytorch dataloader. When I increase the number of workers from 0 (no parallelization) to any number (even just to 1), the time for loading a batch increases by a large factor (~x5), instead of decreasing. My CPU utilization is low, and the images are loaded from my local SSD.

Any idea how to overcome this issue?

How many cores do you use? Main thread is usually faster. Aditionally, how are you measuring that time? Multiprocessing should be slower at the beggining until it reaches its max speed.

I have 12 cores.
With num_workers=0, the utilization is 5-6 cores@100% and the rest @10-15%
With num_workers=1, the utilization is @50-60% on all cores.
With num_workers=2, the utilization is @60-70% on all cores.

I am measuring time by profiling with NVIDIA nsight-systems

In theory mum_workers 0 runs everything on the main thread which should be faster than using a single worker. Di you find the same behaviour using a great amount of cores like 10 or 20? (Note that you usually set amount of threads not cores which is usually twice as many cores as your cou has)

I managed to solve the problem.
It is related to the fact that my code used:

           for i in range(len(train_loader)):
               X, y1, y2 = train_loader.__iter__().__next__()
               <process batch>

Replacing it with the following solved the problem:

           for i, (X, y1, y2) in enumerate(train_loader):
               <process batch>

Thanks for the assistance

1 Like

I am facing the same problem. I didn’t know writing the for loop differently would affect the data loading speed so much, and I have been spending hours trying to figure out what is wrong with higher number of workers.

But why though? What causes the difference? They are basically the same IMO.

This stackoverflow post answers your question

Thanks a lot for the info :smiley: