When use torch.nn.DataParallel
to do multi-gpu training, does loss.backward()
only calculate the grad towards the model on the master gpu or calculate the grad on each gpu and then merge them.
The latter.
In particular, the intermediate results needed for backward are typically on the GPU where you did the forward.
Best regards
Thomas
1 Like
Got it, thanks for your reply:)