Dataloader with Numpy much slower when num_workers > 0

Hello everyone,

I have been working on a project where the data and features are stored in Numpy arrays, and I found that the DataLoader was quite slow when the num_workers > 0, so I decided to recreate the issue with a dummy example:

import numpy as np
from torch.utils.data import DataLoader


class NumpyDataset:
    def __init__(self, size: int):
        self.data = np.random.rand(size, 2)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return {
            "feature": self.data[i][0],
            "target": self.data[i][1]
        }

SIZE = #TBD
BATCH_SIZE = #TBD
NUM_WORKERS = #TBD

ds = NumpyDataset(size=SIZE)
dl = torch.utils.data.DataLoader(dataset=ds, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)
for _ in dl:
    pass

After running some benchmarks for different combinations of parameters, these were the results:

size batch_size num_workers total_time (s)
1e+05 32 0 0.041524
1e+05 32 8 0.387215
1e+05 64 0 0.0182004
1e+05 64 8 0.260331
1e+05 128 0 0.0164798
1e+05 128 8 0.184617
1e+06 32 0 2.30033
1e+06 32 8 28.3145
1e+06 64 0 1.73181
1e+06 64 8 14.0961
1e+06 128 0 1.51957
1e+06 128 8 8.15612
1e+07 32 0 22.3278
1e+07 32 8 281.27
1e+07 64 0 15.7327
1e+07 64 8 151.014
1e+07 128 0 14.3264
1e+07 128 8 75.8562

From these results I could see that:

  • num_workers = 0 is around 1 order of magnitude faster than num_workers = 8
  • The difference appears to reduce with bigger batch sizes

Does anyone know why this might be happening?
Is it recommender to run single thread operations when dealing with NumPy?

Thanks you!

I don’t think this effect is necessarily depending on the usage of numpy but might be the expected overhead from using multiple processes to only index an already preloaded dataset.
Multiple workers are beneficial especially if you are lazily loading and processing the data, i.e. if a single sample is loaded and transformed in each __getitem__ call. In this case each worker will create a full batch in the background while the main thread is busy with the actual model training.
In your example you have already preloaded the dataset in the __init__ such that each worker will only index the data sample from its copy and create the batch, which could yield an overall slowdown due to the added overhead.

1 Like