Num_workers > 0 makes image loading slower rather than faster

yuval · October 18, 2019, 6:03pm

I am training a resnet-18 network in pytorch with a standard pytorch dataloader. When I increase the number of workers from 0 (no parallelization) to any number (even just to 1), the time for loading a batch increases by a large factor (~x5), instead of decreasing. My CPU utilization is low, and the images are loaded from my local SSD.

Any idea how to overcome this issue?

JuanFMontesinos · October 18, 2019, 6:54pm

How many cores do you use? Main thread is usually faster. Aditionally, how are you measuring that time? Multiprocessing should be slower at the beggining until it reaches its max speed.

yuval · October 18, 2019, 11:43pm

I have 12 cores.
With num_workers=0, the utilization is 5-6 cores@100% and the rest @10-15%
With num_workers=1, the utilization is @50-60% on all cores.
With num_workers=2, the utilization is @60-70% on all cores.

I am measuring time by profiling with NVIDIA nsight-systems

JuanFMontesinos · October 19, 2019, 5:29pm

In theory mum_workers 0 runs everything on the main thread which should be faster than using a single worker. Di you find the same behaviour using a great amount of cores like 10 or 20? (Note that you usually set amount of threads not cores which is usually twice as many cores as your cou has)

yuval · October 19, 2019, 5:58pm

I managed to solve the problem.
It is related to the fact that my code used:

           for i in range(len(train_loader)):
               X, y1, y2 = train_loader.__iter__().__next__()
               <process batch>

Replacing it with the following solved the problem:

           for i, (X, y1, y2) in enumerate(train_loader):
               <process batch>

Thanks for the assistance

kinwai_cheuk · May 29, 2020, 4:16pm

I am facing the same problem. I didn’t know writing the for loop differently would affect the data loading speed so much, and I have been spending hours trying to figure out what is wrong with higher number of workers.

MrCrHaM · July 11, 2023, 11:03am

But why though? What causes the difference? They are basically the same IMO.

ahmdtaha · July 16, 2023, 12:00am

This stackoverflow post answers your question

MrCrHaM · July 20, 2023, 2:55am

Thanks a lot for the info