Increasing num_workers
for DataLoader
appears to be of no help, and is always worse than num_workers = 0
across batch sizes 32, 64, 128, 256 (pin_memory
has no effect). Architecture is MacBook Pro, M3 Max, 36GB. Does this make sense?
Here’s the test code (using built-in dataset and DataLoader, where FashionMNIST was previously downloaded on prior run). I saw similar results - actually much worse - on my own dataset (which was backed by a pandas DataFrame of image file paths).
from time import time
import multiprocessing as mp
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms.v2 as v2
epochs_per_inner_loop = 3
dataset = torchvision.datasets.FashionMNIST(
root='data',
train=True,
download=True,
transform=v2.ToImage()
)
for batch_size in [32,64,128,256]:
print(f'batch_size = {batch_size}')
for num_workers in range(0, mp.cpu_count()+2, 2):
dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
t_start = time()
for _ in range(epochs_per_inner_loop):
for _, _ in enumerate(dataloader, 0):
pass
t_end = time()
t_elapsed = t_end - t_start
t_per_epoch = t_elapsed / epochs_per_inner_loop
print(f' {t_per_epoch:.1f} sec, {num_workers} workers')
print('--------------------------')
The results are as follows:
batch_size = 32
0.9 sec, 0 workers
1.9 sec, 2 workers
1.8 sec, 4 workers
1.8 sec, 6 workers
1.9 sec, 8 workers
2.1 sec, 10 workers
2.4 sec, 12 workers
2.9 sec, 14 workers
batch_size = 64
1.0 sec, 0 workers
1.8 sec, 2 workers
1.7 sec, 4 workers
1.7 sec, 6 workers
1.9 sec, 8 workers
2.0 sec, 10 workers
2.4 sec, 12 workers
2.9 sec, 14 workers
batch_size = 128
1.0 sec, 0 workers
1.7 sec, 2 workers
1.7 sec, 4 workers
1.7 sec, 6 workers
1.8 sec, 8 workers
2.1 sec, 10 workers
2.4 sec, 12 workers
2.8 sec, 14 workers
batch_size = 256
1.0 sec, 0 workers
1.6 sec, 2 workers
1.6 sec, 4 workers
1.7 sec, 6 workers
1.8 sec, 8 workers
2.0 sec, 10 workers
2.3 sec, 12 workers
2.9 sec, 14 workers