Torch distributed and num_workers>0 pickle error

farakiko · October 20, 2023, 3:18pm

I think the big picture question here is “how can I serialize only the data, and not the dataloader object?”.

I have written down sample code here https://github.com/farakiko/particleflow/blob/pyg_ddp_numworkers_fix/mlpf/ddp_error.py
and I have been debugging for some time with no solution

The catch is

it works for single gpu when num_workers is None or >0
it works for multigpu only when num_workers is None (else i get the following error)

zzz 71920×1500 269 KB

The data_dir can be found here https://pfvol.nrp-nautilus.io/tensorflow_datasets_small/cms_pf_ttbar/1.6.0/

Please any help would be appreciated!

H-Huang · October 20, 2023, 3:52pm

Are your Dataset and collate_fn picklable? torch.utils.data — PyTorch 2.1 documentation

You could also try setting multiprocess start method to “fork” multiprocessing — Process-based parallelism — Python 3.12.0 documentation

farakiko · October 20, 2023, 3:54pm

thanks, “fork” works for the dataloader part but then i can’t move the tensors to CUDA because fork doesn’t work with CUDA

farakiko · October 21, 2023, 5:29pm

i assume since the code runs fine with num_workers>0 for both cpu and single gpu, then my collate function is serializable. is that not always the case?

farakiko · October 21, 2023, 6:34pm

ok it seems that my datasource cannot be serialized. is there a way around that? and do you happen to know why the code works fine with torch distributed expect when num_workers>0

H-Huang · October 23, 2023, 2:14pm

This is more of a mechanism of the dataloader rather than any distributed code, see the mulitprocess section of dataloader torch.utils.data — PyTorch 2.1 documentation or you can try asking for help in that topic of the forum data - PyTorch Forums. My understanding is that when num_workers > 0 each worker process will pickle the dataset when it has to create an instance of the dataloader, my guess is this is to prevent multiple instatiations on the dataset across multiple processes.

If you are dataset cannot be pickled, you can try modifying the dataset to make it “pickle-able”, see pickle — Python object serialization — Python 3.12.0 documentation

farakiko · October 23, 2023, 2:25pm

Thanks for the help! Will check those references

farakiko · October 24, 2023, 3:13pm

thanks alot for the help! your references helped alot!!

I managed to solve it by defining this part to avoid pickling the dataset_info class and just pickle a SimpleNamespace that holds the content the dataloader would need from dataset_info

will leave it here in case it helps anyone!