DataLoader wrap-up time kills performance on both Intel and M1 Macs, but not on Colab

I found the solution in this past discussion:

In short:

DataLoader(...
    multiprocessing_context="forkserver",
    persistent_workers=True)

multiprocessing_context="forkserver" eliminates the 5s-per-worker hangup when exiting (tearing down) the DataLoader iteration. This is only needed on a Mac, not Linux, because the system level mp context differs on those platforms and the default PyTorch value uses the system value.

persistent_workers offers further gains, mostly unrelated to my original post above but relevant in multi-epoch runs, on all epochs after the first one.

I realize the discussion of multiprocessing_context indicates that this problem occurs because Linux and Mac use different multiprocessing contexts (spawn and fork, and forkserver, I guess) but I’m still curious why PyTorch doesn’t use the “correct” value of multiprocessing_context on a per-architecture basis. Why doesn’t it default to “forkserver” when running on a Mac?

Likewise, I’m curious why the default value of persistent_workers is the less performant value. I presume there is a good reason to default it to False and require the user to set to True if and only if they want it to be True, but I’m unclear when I would ever want it to be False. Why is that the better default value instead of the other way around (default to the more performant value and require the user to override it if they see reason to do so)? I’m new to PyTorch though, so I’m sure I’m just not understanding the parameter well enough to realize there is a perfectly good explanation.

Cheers!