Order of dataparallel and optimizer

Hi everyone,
I came across a problem and can not figure out which one is right and the reason:

model = nn.DataParallel(model)
optimizer = optim.SGD(model.parameters())
##training code

optimizer = optim.SGD(model.parameters())
model = nn.DataParallel(model)
##training code

I guess the first one is logically right, since pytorch is going to optimize the parallelized model’s parameters. However, in my experiment, both worked.
I wonder which one is the right way to set up the training and why.

Thanks a lot!!

Both should work, since the model is pushed to the default device and the optimization will also take place on this device.
This blogpost by @Thomas_Wolf explains it in a nice way.

Thanks for your sharing! My understanding is that the parameter updating is done after the gradient is reduced to the GPU-1(the GPU which scatters replica and inputs), and the parameter updating only takes place on GPU-1. Please correct me if wrong.

Yes, that’s also my understanding, which is why I think that both approaches should work.

1 Like