Hello everyone,
I have been working on a project where the data and features are stored in Numpy arrays, and I found that the DataLoader was quite slow when the num_workers
> 0, so I decided to recreate the issue with a dummy example:
import numpy as np
from torch.utils.data import DataLoader
class NumpyDataset:
def __init__(self, size: int):
self.data = np.random.rand(size, 2)
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return {
"feature": self.data[i][0],
"target": self.data[i][1]
}
SIZE = #TBD
BATCH_SIZE = #TBD
NUM_WORKERS = #TBD
ds = NumpyDataset(size=SIZE)
dl = torch.utils.data.DataLoader(dataset=ds, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)
for _ in dl:
pass
After running some benchmarks for different combinations of parameters, these were the results:
size | batch_size | num_workers | total_time (s) |
---|---|---|---|
1e+05 | 32 | 0 | 0.041524 |
1e+05 | 32 | 8 | 0.387215 |
1e+05 | 64 | 0 | 0.0182004 |
1e+05 | 64 | 8 | 0.260331 |
1e+05 | 128 | 0 | 0.0164798 |
1e+05 | 128 | 8 | 0.184617 |
1e+06 | 32 | 0 | 2.30033 |
1e+06 | 32 | 8 | 28.3145 |
1e+06 | 64 | 0 | 1.73181 |
1e+06 | 64 | 8 | 14.0961 |
1e+06 | 128 | 0 | 1.51957 |
1e+06 | 128 | 8 | 8.15612 |
1e+07 | 32 | 0 | 22.3278 |
1e+07 | 32 | 8 | 281.27 |
1e+07 | 64 | 0 | 15.7327 |
1e+07 | 64 | 8 | 151.014 |
1e+07 | 128 | 0 | 14.3264 |
1e+07 | 128 | 8 | 75.8562 |
From these results I could see that:
-
num_workers = 0
is around 1 order of magnitude faster thannum_workers = 8
- The difference appears to reduce with bigger batch sizes
Does anyone know why this might be happening?
Is it recommender to run single thread operations when dealing with NumPy?
Thanks you!