Torch distributed and dataset pickle error

farakiko · October 21, 2023, 6:32pm

Hi, my dataset is a tensorflow_dataset source which unfortunately cannot be serialized (see image below). If I use “spawn” or “forkserver” as my torch distributed start method, I get the same error in the image when each process attempts to retrieve batches from the dataset.

If I use “fork” then the dataloading works but then I cannot send the tensors to CUDA (because “fork” is not CUDA supported).

Anyone knows how to run torch distributed and get around a dataset that cannot be serialized?

H-Huang · October 23, 2023, 2:15pm

replied in your other thread: Torch distributed and num_workers>0 pickle error - #6 by H-Huang