Will `backward()` consider all the models across various GPUs with `nn.DataParallel`?

smmislam · March 10, 2022, 2:44am

I have a model like the following

class MyModel(nn.Module):
    ...
    def forward(self, x):
        ...
        return y

    def compute_loss(self, y, t):
        ...
        return loss

    def update(self, loss):
        loss.backward()
        self.optimizer.step()
        self.optimizer.zero_grad()

I am trying to run this model on two GPUs using DataParallel. And my main() looks something like this

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = nn.DataParallel(MyModel).to(DEVICE)
...
for xb, yb in train_dl:
    y = model(xb) # Data being fed to two GPUs, I'm happy :)
    loss = model.module.compute_loss(y, yb)  # This is the issue
    model.module.update(loss) # This too!

I am wondering, will this consider the two different versions of the model living in two different GPUs during weight update?

My code is running on both the GPUs (checked with nvidia-smi), but I am having doubt that even though during forward() both the GPUs are being used, compute_loss and update are only processing loss/gradients for the model living at cuda:0, because I’m calling compute_loss and update via module.

Or it could be that the loss is linked to both the GPU’s gradients through the computation graph since model(xb) returned combined y from both the GPUs and because of that loss.backward() would run through both of them.

Could you please shed some lights on my doubt?
Thank you so much!

ptrblck · March 10, 2022, 7:57am

The loss will be computed on the default device, but the backward pass should be executed on both devices again as the loss calculation should still be attached for the computation graph created in the forward on both devices.

smmislam · March 11, 2022, 1:13am

@ptrblck Thank you for your comment!

Meanwhile I was able to find another very useful thread where a beautiful illustration of how DataParallel actually works in the background is shared.

I think that the code that I shared above, might not work properly, because I declared the optimizer as part of the model. And, since I’m calling the optimizer through module, it will only update the default GPU’s model weights. Even if the errors manages to flow back to both the models because of linked computation graph, I do not see any routine to merge the gradients as one and make one global update.

However, if I instantiate optimizer independently and call it on model.parameters from the main() step 5 and 6 (from the illustration) should run as intended.

@rasbt Thank you so much for the illustration. Do you think my above remark is correct?

ptrblck · March 11, 2022, 4:58am

I don’t think the description is completely correct for nn.DataParallel as you would be handling a single model only and the gradients should be already reduced to the default model on the default device.
However, we generally don’t recommend nn.DataParallel as it’s slower than DistributedDataParallel so I also don’t know who the “internal” optimizer might behave.
In any case, since you have concerns I would recommend to run a quick test using defined data to check the behavior.

smmislam · March 11, 2022, 5:18am

I could not really see which specific command would cause this reduction of gradients. Unless there is some mechanism inside the optimizer to check if there are multiple GPUs involved and hence reduce the gradients automatically.

But, I would like to do the quick test you mentioned. What do you mean by defined data? And how to do such tests? Would really appreciate if you could point me somewhere with a bit more explanation.

Apologies for the noob questions.

ptrblck · March 11, 2022, 6:12am

This blog post as well as the post you’ve linked in your previous method show where the gradients are reduced to the default device.