Dataloader with Numpy much slower when num_workers > 0

Hello everyone,

I have been working on a project where the data and features are stored in Numpy arrays, and I found that the DataLoader was quite slow when the num_workers > 0, so I decided to recreate the issue with a dummy example:

import numpy as np
from import DataLoader

class NumpyDataset:
    def __init__(self, size: int): = np.random.rand(size, 2)

    def __len__(self):
        return len(

    def __getitem__(self, i):
        return {


ds = NumpyDataset(size=SIZE)
dl =, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)
for _ in dl:

After running some benchmarks for different combinations of parameters, these were the results:

size batch_size num_workers total_time (s)
1e+05 32 0 0.041524
1e+05 32 8 0.387215
1e+05 64 0 0.0182004
1e+05 64 8 0.260331
1e+05 128 0 0.0164798
1e+05 128 8 0.184617
1e+06 32 0 2.30033
1e+06 32 8 28.3145
1e+06 64 0 1.73181
1e+06 64 8 14.0961
1e+06 128 0 1.51957
1e+06 128 8 8.15612
1e+07 32 0 22.3278
1e+07 32 8 281.27
1e+07 64 0 15.7327
1e+07 64 8 151.014
1e+07 128 0 14.3264
1e+07 128 8 75.8562

From these results I could see that:

  • num_workers = 0 is around 1 order of magnitude faster than num_workers = 8
  • The difference appears to reduce with bigger batch sizes

Does anyone know why this might be happening?
Is it recommender to run single thread operations when dealing with NumPy?

Thanks you!

I don’t think this effect is necessarily depending on the usage of numpy but might be the expected overhead from using multiple processes to only index an already preloaded dataset.
Multiple workers are beneficial especially if you are lazily loading and processing the data, i.e. if a single sample is loaded and transformed in each __getitem__ call. In this case each worker will create a full batch in the background while the main thread is busy with the actual model training.
In your example you have already preloaded the dataset in the __init__ such that each worker will only index the data sample from its copy and create the batch, which could yield an overall slowdown due to the added overhead.

1 Like

Hey @ptrblck,
I have a similar issue despite loading and processing the data in __getitem__ call instead of __init__. Is there any other possible reason why?

Assuming you are lazily loading the samples in the __getitem__ I would guess you might see a bad perf if the actual loading is the bottleneck and thus blocks the other code parts (which could be the case if you are using e.g. a slow network drive).
Could you profile the data loader alone and see how the speed is for different number of workers?

I used torch profiler and here what I got:

Dataloaders time (microsecs) with various num workers
0: 1,484,033
1: 13,361,716
2: 14,867,360
3: 14,596,160
4: 15,243,254
5: 14,412,214
20: 15,291,521

I want to understand the multiprocessing happening here further. Are the operations in __init__ done in the main thread and only the operations inside __getitem__ happen in each of the multiple workers? Hence, in the XavierB’s case the overhead of copying the loaded data (that happened in the __init__) to each worker is the bottleneck. Is it correct that __init__ should contain as minimum operation as possible?

Also, after the workers load the data, will they put the loaded the data in sort of “queue” (though maybe queue is not the right term here) to be fed to GPU during training? If yes, then I suspect there will be an optimum num_workers because if we keep increasing that number, we will just create a long queue that is not necessary (and perhaps consumes the memory)?