Torch distributed and num_workers>0 pickle error

I think the big picture question here is “how can I serialize only the data, and not the dataloader object?”.

I have written down sample code here https://github.com/farakiko/particleflow/blob/pyg_ddp_numworkers_fix/mlpf/ddp_error.py
and I have been debugging for some time with no solution :frowning:

The catch is

  1. it works for single gpu when num_workers is None or >0
  2. it works for multigpu only when num_workers is None (else i get the following error)

The data_dir can be found here https://pfvol.nrp-nautilus.io/tensorflow_datasets_small/cms_pf_ttbar/1.6.0/

Please any help would be appreciated!

Are your Dataset and collate_fn picklable? torch.utils.data — PyTorch 2.1 documentation

You could also try setting multiprocess start method to “fork” multiprocessing — Process-based parallelism — Python 3.12.0 documentation

thanks, “fork” works for the dataloader part but then i can’t move the tensors to CUDA because fork doesn’t work with CUDA

i assume since the code runs fine with num_workers>0 for both cpu and single gpu, then my collate function is serializable. is that not always the case?

ok it seems that my datasource cannot be serialized. is there a way around that? and do you happen to know why the code works fine with torch distributed expect when num_workers>0

This is more of a mechanism of the dataloader rather than any distributed code, see the mulitprocess section of dataloader torch.utils.data — PyTorch 2.1 documentation or you can try asking for help in that topic of the forum data - PyTorch Forums. My understanding is that when num_workers > 0 each worker process will pickle the dataset when it has to create an instance of the dataloader, my guess is this is to prevent multiple instatiations on the dataset across multiple processes.

If you are dataset cannot be pickled, you can try modifying the dataset to make it “pickle-able”, see pickle — Python object serialization — Python 3.12.0 documentation

Thanks for the help! Will check those references :slight_smile:

thanks alot for the help! your references helped alot!!

I managed to solve it by defining this part to avoid pickling the dataset_info class and just pickle a SimpleNamespace that holds the content the dataloader would need from dataset_info

will leave it here in case it helps anyone!