Dataloader: more num_workers do not reduce runtime?

Hi everyone! I am using dataloader for NN training on 1 GPU. When I increased the num_workers to 2, 4 and 8, runtime does not reduce. Can someone help to explain why it is so and how to improve runtime?

(Other parameters setting: Shuffle= True, pin_memory = True)

Thank you!


the best way depends on the data and what you are doing with it. This could mean that your CPU is saturated, that the bottleneck isn’t the CPU (but e.g. storage) or something else.

Some general ideas:

  • do on-the-fly processing (e.g. augmentation) on the GPU as much as you can, i.e. not in the dataloader,
  • storage can be a huge bottleneck, e.g. don’t load huge images only to scale them down, but instead do a preprocessing step in advance where you scale down to a “reasonable size” (i.e. doesn’t need to be the final size, but I’ve seen things being slow because XX Megapixel images were loaded only to then immediately rescale + crop to 227x227,
  • if you adjust your pipeline, the Thomas rule of thumb is it isn’t optimization unless you measure before and after. Things like pin_memory are often advocated, but people sometimes find it helps and sometimes now.

Best regards


Thank you for the quick response Tom! I’ve just figured out the problem in my last reply here Dataloader num_workers relate to gpu memory? - Memory Format - PyTorch Forums