Training independent networks in parallel with reproducibility

I would like to train, say, 10 independent neural networks on 5 GPUs in parallel (by training two on each GPU assuming there are no memory constraints). Also, I would like the code to be reproducible. Therefore, I have been training each network by re-running the same Python script, changing only the GPU device. I specify torch.manual_seed for reproducibility.

Is there anyway to do this in a single Python script where PyTorch does the work of distributing the networks to the GPUs while maintaining reproducibility? I am aware of threading as one way to do this (something similar to ModelParallel), but I am worried about reproducibility of my code when using threading.

Thank you for the help!

Hi,

Even with the same code, if you use different hardware/software versions, we don’t guarantee reproducibility. See the doc about this here.

If you want to ensure that the scripts are independent. I would recommend using a simple batch script that launch all the jobs :slight_smile:

Thanks for the reply! I have seen the documentation on reproducibility.

In this case, to begin with, I am only looking for reproducibility for my particular set-up (hardware and software). I am indeed using a shell script to launch the jobs, but I thought it would be neater if I could do this all within one script with PyTorch responsible for the distribution.

The level of “independence” between the run would be much smaller if you run the whole think in a single process. They will share the same python interpreter, the same cuda allocator, same memory space.

The ModelParallel tool that you linked seems interesting. But I don’t know of any other work along those lines…