I have a model like the following
def forward(self, x):
def compute_loss(self, y, t):
def update(self, loss):
I am trying to run this model on two GPUs using
DataParallel. And my
main() looks something like this
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = nn.DataParallel(MyModel).to(DEVICE)
for xb, yb in train_dl:
y = model(xb) # Data being fed to two GPUs, I'm happy :)
loss = model.module.compute_loss(y, yb) # This is the issue
model.module.update(loss) # This too!
I am wondering, will this consider the two different versions of the model living in two different GPUs during weight update?
My code is running on both the GPUs (checked with
nvidia-smi), but I am having doubt that even though during
forward() both the GPUs are being used,
update are only processing loss/gradients for the model living at
cuda:0, because I’m calling
Or it could be that the
loss is linked to both the GPU’s gradients through the computation graph since
model(xb) returned combined
y from both the GPUs and because of that
loss.backward() would run through both of them.
Could you please shed some lights on my doubt?
Thank you so much!
The loss will be computed on the default device, but the backward pass should be executed on both devices again as the loss calculation should still be attached for the computation graph created in the forward on both devices.
@ptrblck Thank you for your comment!
Meanwhile I was able to find another very useful thread where a beautiful illustration of how
DataParallel actually works in the background is shared.
I think that the code that I shared above, might not work properly, because I declared the optimizer as part of the model. And, since I’m calling the optimizer through
module, it will only update the default GPU’s model weights. Even if the errors manages to flow back to both the models because of linked computation graph, I do not see any routine to merge the gradients as one and make one global update.
However, if I instantiate optimizer independently and call it on
model.parameters from the
main() step 5 and 6 (from the illustration) should run as intended.
@rasbt Thank you so much for the illustration. Do you think my above remark is correct?
I don’t think the description is completely correct for
nn.DataParallel as you would be handling a single model only and the gradients should be already reduced to the default model on the default device.
However, we generally don’t recommend
nn.DataParallel as it’s slower than
DistributedDataParallel so I also don’t know who the “internal” optimizer might behave.
In any case, since you have concerns I would recommend to run a quick test using defined data to check the behavior.
I could not really see which specific command would cause this reduction of gradients. Unless there is some mechanism inside the
optimizer to check if there are multiple GPUs involved and hence reduce the gradients automatically.
But, I would like to do the quick test you mentioned. What do you mean by defined data? And how to do such tests? Would really appreciate if you could point me somewhere with a bit more explanation.
Apologies for the noob questions.
This blog post as well as the post you’ve linked in your previous method show where the gradients are reduced to the default device.