Data loader multiprocessing slow on macOS

adamjstewart · September 6, 2021, 12:08am

I’ve noticed a significant slowdown when using num_workers > 0 on macOS for Python 3.8+. With the following benchmarking script:

import sys
import time

from torch.utils.data import DataLoader, Dataset


class MyDataset(Dataset):
    def __getitem__(self, i):
        return i

    def __len__(self):
        return 5


if __name__ == '__main__':
    ds = MyDataset()
    dl = DataLoader(ds, num_workers=int(sys.argv[1]))

    start = time.time()
    for i in dl:
        continue
    end = time.time()
    print(end - start)

I see these times on macOS:

$ python test.py 0
0.0023140907287597656
$ python test.py 1
9.695190906524658
$ python test.py 2
14.391160249710083
$ python test.py 3
18.87164807319641

I would expect these to take approximately the same time. Obviously, there is some overhead when creating and offloading parallel workers, but this is significantly greater than on Linux:

$ python test.py 0
0.0008292198181152344
$ python test.py 1
0.025320768356323242
$ python test.py 2
0.03378176689147949
$ python test.py 3
0.030670166015625

The issue seems to be that starting with Python 3.8, the default multiprocessing start method changed from fork to spawn. The main difference is that with spawn, all resources of the parent need to be pickled so they can be inherited by the child. It seems that this is where the slowdown is coming from, but I can’t figure out how to speed up pickling times. Do I need to add a __reduce__ attribute to any of these objects or something? Or is this overhead always required and there’s nothing I can do about it? I realize that this slowdown is miniscule compared to the time involved in actually loading real data and passing it through a model, but it’s really slowing down my unit tests and affecting other benchmarking I’m trying to do.

jia.kai · September 6, 2022, 2:36pm

I also encountered this problem. The slowest part seems to be worker startup/shutdown (you can add print(i) before continue; data retrieving is quite fast). I don’t have time to investigate the root cause. My temporary workaround is to use mp.set_start_method('forkserver').

laclouis5 · April 20, 2023, 1:22pm

I also noticed that DataLoader shutdown is very slow (between 5s and 10s), even in a recent environment (MacBook Pro 14" with M1 Pro running PyTorch 2.0.0). As noted by @jia.kai, the issue is that PyTorch multiprocessing uses the spawn method on macOS (and also Linux I guess).

The answer from @jia.kai works fine, however, I would recommend using the dedicated DataLoader parameter instead:

from torch.utils.data import DataLoader

def main():
  dataset = ...

  dataloader = DataLoader(
    dataset, num_workers=4, 
    multiprocessing_context="forkserver",
  )

if __name__ == "__main__":
  main()

If used in a training loop, I would also recommend passing persistent_workers=True to the DataLoader input paramsin order to avoid recreating the worker processes at the beginning of each dataset iteration.

Also, instead of the Python multiprocessing library I would recommend using torch.multiprocessing when it makes sense to do so.

kbwiley · December 24, 2023, 7:15pm

Oh man! Thank you! Both of your suggestions made immediate improvements.

multiprocessing_context="forkserver" eliminated the huge delay (5s/allocated-worker) wrapping up the final iteration over the batches of the DataLoader, and then the subsequent suggestion of persistent_workers=True further removed some apparently unnecessary slowdowns in all batches after the first one.

I wish this would be better incorporated into the source. If PyTorch detects that it is running on a Mac, it should default to the more performant multiprocessing_context, unless there is some good reason for it to default to a value that seemingly renders PyTorch all but usless on a Mac (I’m new to PyTorch and am willing to believe there is a justification for the default value, but I still don’t see how to get any practical utility out of PyTorch on a Mac without altering the mp context as described; perhaps I am misunderstanding something.). As for persistent_workers, if it is all gain at no obvious cost, I’m curious what the default value is the slower option. Again, I suspect I am not understanding it and there is good reason for the default value to be the seemingly less desirable value in all scenarios I can imagine, but again, since I’m new, perhaps I’m just unaware of reasonable reasons for the default to be as it is instead of the opposite value.

Thanks again!