DataLoader slower with num_workers > 0

rooks · October 29, 2020, 7:50am

Very simple use case.

Some data pre-loaded in memory. I just fetch each sample directly from memory

from torch.utils.data import Dataset, DataLoader
import numpy as np

class NpDataset(Dataset):
    def __init__(self, n, m):
        self.X = np.random.rand(n, m)
        self.Y = np.random.rand(n, m)
        
    def __getitem__(self, index):
        return self.X[index], self.Y[index]
    
    def __len__(self):
        return self.X.shape[0]

I create such a dataset with 100,000 samples of size 1000
ds = NpDataset(100_000, 1000)

then if I iterate over it using dataloader, using multiple workers does not improve the speed of the iteration:

loader = DataLoader(ds, batch_size=100, num_workers=4, shuffle=True, pin_memory=False)
for x, y in loader:
    pass

takes 3.7s
while with num_workers=0 it takes only 2s.

Any idea why in such a case multiple workers does not improve the speed?

nivesh_gadipudi · October 29, 2020, 8:20am

Selecting a num_workers is pretty tricky and as I migrated slowly to pytorchLightining it gives you a warning with suitable number of num_workers depending on your hardware and data. But in pytorch I think as of now it’s a trail and error.

FarzanT · November 21, 2020, 5:33am

For me, increasing num_workers reduces data loading per batch, but also occasionally slows down so much that e.g. per 100 batches, it is slower than when num_workers=0. I haven’t figured out what can cause these hiccups.

bkakilli · April 8, 2021, 5:16pm

I’ve been also experiencing the same issue for a while. I am not even sure I ever truly benefited using multiple-workers since I noticed this problem rather late. I have 8 cores and I have tried running with different number of workers 0, 1, 2, …, 8. The main thread (0 workers) gave me the fastest loading consistently. This is also the case for the data that is pre-loaded in the memory.

“pytorch version 1.8.1”

Mashood3624 · June 15, 2022, 7:57am

Facing the same issue.

imayachita · July 7, 2022, 2:09pm

any explanation on this? I’m experiencing the same issue

BrianPulfer · July 27, 2022, 3:50pm

Same issue here using PyTorch 1.12.0. PytorchLightning throws a PossibleUserWarning and suggests to use 8 workers (which is the number of cores in my M1 CPU), but doing so results in a huge slow down.

thesofakillers · May 19, 2023, 1:26pm

Also noticing this on MacOS. PyTorch 1.13; PL 1.18

shengy90 · June 15, 2023, 1:28pm

Also experiencing this on M1 Pro machine. Pytorch 2.0.1, Pytorch Lightning 2.0.3.

jw1400 · June 28, 2023, 3:31pm

Same behaviour on Windows 10 Pro, Pytorch Lightning 2.0.4, Torch 2.0.0+cu117
Got 20 cores and putting num_workers on 20 causes a slowdown of several minutes between each epoch. Putting num_workers on 1 or 2 does already lead to a much better result with a slowdown of 20 seconds. With num_workers at 0 I’ve received the by far best results with a slowdown of maybe 2-3 seconds at most.

BrunoBelucci · October 25, 2023, 5:48pm

Same on Ubuntu 20.04.5 LTS, using pytorch 2.1.0 and lightning 2.1.0. Do we have any updates on this? Is there any guideline as to when we should set num_workers > 0?

dscarmo · May 6, 2024, 8:59pm

For people arriving here looking for an answer, the general recommendation of num_workers = number of CPU threads is not valid in many usecases.

In the use case mentioned in this post, since data is already in memory I would guess the overhead of spinning multiple processes makes using parallel loading unfeasible.

Another use case that makes using num_workers difficult is when dealing with 3D data, where significant file reading concurrency might actually make loading slower than when using for example, 1 worker.

In most cases even if 2 workers is slower, 1 should be better than 0, since when using 0 you are competing with the code in your main training loop. However, some image processing functions from data augmentation libraries work faster when called in the main thread than when called in sub processes. In sub processes they are limited to using 100% of one thread, while I have seen the same code using full multithreading in the underlying C implementations only when called with num_workers 0. I don’t know why that happens.

So, this is a mess and really should only be determined through experimentation for each use case. Its way more complicated than it looks!