Data loader multiprocessing slow on macOS

I’ve noticed a significant slowdown when using num_workers > 0 on macOS for Python 3.8+. With the following benchmarking script:

import sys
import time

from torch.utils.data import DataLoader, Dataset


class MyDataset(Dataset):
    def __getitem__(self, i):
        return i

    def __len__(self):
        return 5


if __name__ == '__main__':
    ds = MyDataset()
    dl = DataLoader(ds, num_workers=int(sys.argv[1]))

    start = time.time()
    for i in dl:
        continue
    end = time.time()
    print(end - start)

I see these times on macOS:

$ python test.py 0
0.0023140907287597656
$ python test.py 1
9.695190906524658
$ python test.py 2
14.391160249710083
$ python test.py 3
18.87164807319641

I would expect these to take approximately the same time. Obviously, there is some overhead when creating and offloading parallel workers, but this is significantly greater than on Linux:

$ python test.py 0
0.0008292198181152344
$ python test.py 1
0.025320768356323242
$ python test.py 2
0.03378176689147949
$ python test.py 3
0.030670166015625

The issue seems to be that starting with Python 3.8, the default multiprocessing start method changed from fork to spawn. The main difference is that with spawn, all resources of the parent need to be pickled so they can be inherited by the child. It seems that this is where the slowdown is coming from, but I can’t figure out how to speed up pickling times. Do I need to add a __reduce__ attribute to any of these objects or something? Or is this overhead always required and there’s nothing I can do about it? I realize that this slowdown is miniscule compared to the time involved in actually loading real data and passing it through a model, but it’s really slowing down my unit tests and affecting other benchmarking I’m trying to do.

I also encountered this problem. The slowest part seems to be worker startup/shutdown (you can add print(i) before continue; data retrieving is quite fast). I don’t have time to investigate the root cause. My temporary workaround is to use mp.set_start_method('forkserver').