Results reproducibility using torch.multiprocessing

willianr · November 16, 2018, 6:13pm

I’m trying to do my code reproducible using same parameters and, in fact, it works fine when I do not use torch.multiprocessing.

So, in both codes I have the following seed and cudnn sets:

# Set seed for deterministic results
torch.manual_seed(12345)
torch.cuda.manual_seed(12345)
np.random.seed(12345)
random.seed(12345)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Also, I’m passing the following function as worker_init_fn argument of torch.utils.data.DataLoader:

# function to set dataloader worker seed
def _init_fn(worker_id):
    np.random.seed(12345 + worker_id)

Dataloader num_workers argument is set to 1.

While I have deterministic and reproducible results running the code on a single process, training the network using torch.multiprocessing do not give me the same deterministic reproducibility.

The only difference between both codes is that when I use torch.multiprocessing, I set all the seeds and create the dataloader on father process and create the model and train on child process.

Question is, am I missing something to make my results reproducible using torch.multiprocessing? Any insight is really appreciated.

By the way, I have just one GPU and I am spawning just one process, so in the torch.multiprocessing code father is just a manager and training happens on child. I have multiple models so every new model training happens on a new child process after the previous child finished.

willianr · November 16, 2018, 7:11pm

Okay, formulate this topic to post helped me thinking about the problem. Actually, my assumption that if I’m running the model training sequentially, the RNG state passed to the child will always be the same is not true.

If I reseed and set the cudnn flags on the child again, then I have a deterministic and reproducible code.

So, for future reference of others having the same problem, to solve it I had to reseed and set cudnn flags just like in the father, but in the child too:

# Set seed for deterministic results
torch.manual_seed(12345)
torch.cuda.manual_seed(12345)
np.random.seed(12345)
random.seed(12345)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Just reseed was not enough, the cudnn flags need to be set.

Now, while I understand the reason why it works, I do not understand why I had to reset everything on the child if it was supposed to get all the RNG states from the father.