Hi everyone,
I came across a problem and can not figure out which one is right and the reason:
model = nn.DataParallel(model)
optimizer = optim.SGD(model.parameters())
##training code
optimizer = optim.SGD(model.parameters())
model = nn.DataParallel(model)
##training code
I guess the first one is logically right, since pytorch is going to optimize the parallelized model’s parameters. However, in my experiment, both worked.
I wonder which one is the right way to set up the training and why.
Thanks a lot!!
Both should work, since the model
is pushed to the default device and the optimization will also take place on this device.
This blogpost by @Thomas_Wolf explains it in a nice way.
Thanks for your sharing! My understanding is that the parameter updating is done after the gradient is reduced to the GPU-1(the GPU which scatters replica and inputs), and the parameter updating only takes place on GPU-1. Please correct me if wrong.
Yes, that’s also my understanding, which is why I think that both approaches should work.
1 Like