Sharing dataset between subprocess

mzimmer · July 12, 2020, 7:34pm

Suppose I am using a quite big Dataset which I want to load once. Using multiprocessing.pool I spawn several different processes which train different models with different hyperparameters, so this is not a distributed learning problem but rather a parallelization of different models. What I want is that each subprocess has access to the once loaded dataset and can build its own DataLoader to receive batches from.

How would one do this in Pytorch in the best way? I can’t figure out a way that makes the Dataset accessible to all subprocesses (except defining it in every subprocess, which is costly).
Thanks!

ptrblck · July 13, 2020, 1:25am

A while ago I’ve created this example using shared arrays for multiple workers of a DataLoader, which might be useful for you.

mzimmer · July 13, 2020, 7:47am

Thanks for the fast reply, but I am not sure how this can help me. I do not want to share a dataset between multiple workers of a DataLoader. What I would need is to share a Dataset between subprocesses where each subprocess has its own dataloader, independent of the others.

Could I somehow make the Dataset accessible in shared memory and then build my dataloader inside the subprocess routine?