Torch distributed and num_workers>0 pickle error

This is more of a mechanism of the dataloader rather than any distributed code, see the mulitprocess section of dataloader torch.utils.data — PyTorch 2.1 documentation or you can try asking for help in that topic of the forum data - PyTorch Forums. My understanding is that when num_workers > 0 each worker process will pickle the dataset when it has to create an instance of the dataloader, my guess is this is to prevent multiple instatiations on the dataset across multiple processes.

If you are dataset cannot be pickled, you can try modifying the dataset to make it “pickle-able”, see pickle — Python object serialization — Python 3.12.0 documentation