DataLoader num_workers does not help

Increasing num_workers for DataLoader appears to be of no help, and is always worse than num_workers = 0 across batch sizes 32, 64, 128, 256 (pin_memory has no effect). Architecture is MacBook Pro, M3 Max, 36GB. Does this make sense?

Here’s the test code (using built-in dataset and DataLoader, where FashionMNIST was previously downloaded on prior run). I saw similar results - actually much worse - on my own dataset (which was backed by a pandas DataFrame of image file paths).

from time import time
import multiprocessing as mp
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms.v2 as v2

epochs_per_inner_loop = 3

dataset = torchvision.datasets.FashionMNIST(
    root='data',
    train=True,
    download=True,
    transform=v2.ToImage()
)


for batch_size in [32,64,128,256]:
    print(f'batch_size = {batch_size}')

    for num_workers in range(0, mp.cpu_count()+2, 2):
        
        dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

        t_start = time()
        for _ in range(epochs_per_inner_loop):
            for _, _ in enumerate(dataloader, 0):
                pass
        
        t_end = time()
        t_elapsed = t_end - t_start
        t_per_epoch = t_elapsed / epochs_per_inner_loop

        print(f'  {t_per_epoch:.1f} sec, {num_workers} workers')
        
    print('--------------------------')

The results are as follows:

batch_size = 32
0.9 sec, 0 workers
1.9 sec, 2 workers
1.8 sec, 4 workers
1.8 sec, 6 workers
1.9 sec, 8 workers
2.1 sec, 10 workers
2.4 sec, 12 workers
2.9 sec, 14 workers

batch_size = 64
1.0 sec, 0 workers
1.8 sec, 2 workers
1.7 sec, 4 workers
1.7 sec, 6 workers
1.9 sec, 8 workers
2.0 sec, 10 workers
2.4 sec, 12 workers
2.9 sec, 14 workers

batch_size = 128
1.0 sec, 0 workers
1.7 sec, 2 workers
1.7 sec, 4 workers
1.7 sec, 6 workers
1.8 sec, 8 workers
2.1 sec, 10 workers
2.4 sec, 12 workers
2.8 sec, 14 workers

batch_size = 256
1.0 sec, 0 workers
1.6 sec, 2 workers
1.6 sec, 4 workers
1.7 sec, 6 workers
1.8 sec, 8 workers
2.0 sec, 10 workers
2.3 sec, 12 workers
2.9 sec, 14 workers

I think I understand the issue here (basically user error): FashionMNIST comprises 60k samples in a single file so we are I/O limited in reading all samples from same file. This is further compounded by the fact that Dataset transform is the trivial v2.ToImage() so there is no real benefit to parallelization.

Using my own (small 1k dataset) with non-trivial transform (Compose of RandomRotation, RandomPerspective, ColorJitter, RandomEqualize, RandomResizedCrop, Normalize) on images of varying sizes I see results I would expect:

22.5 sec, 0 workers
16.6 sec, 2 workers
9.8 sec, 4 workers
7.9 sec, 6 workers
6.5 sec, 8 workers
6.4 sec, 10 workers
6.2 sec, 12 workers
6.7 sec, 14 workers