I am trying to do multi gpu training with DistributedDataParallel. I wrap it around my model. However my model has a custom function that now i call by doing model.module.function(x). I was wondering if this is ok and if something bad will happen. Thanks
What does this custom function do? and when do you call this custom function? If it does not modify parameters and the autograd graph built during the forward pass, it should be OK.
The pseudo code is something like this
output = model(input) output2 = model(input2) final_output = model.module.function(output, output2) loss = loss_function(final_output) optimizer.zero_grad() loss.backward() optimizer.step()
Would this be fine? The custom function is just a MLP to classify something. It does not change anything, but I want it to get updated when I call my optimizer.step()
model is a
DistributedDataParallel (DDP) instance, this won’t work. Because setup some internal states at the end of the forward pass, and does not work if you call
forward twice without a backward in between.
However, this can be easily solve by wrapping the two forward and the
function invocation into a wrapper model, and then pass that wrapper model to DDP, sth like:
class WrapperModel(nn.Module): def __init__(self, model) : super(WrapperModel, self).__init__() self.model = model def forward(input, input2): output = model(input) output2 = model(input2) final_output = model.module.function(output, output2) return final_output ddp = DistributedDataParallel(WrapperModel(model).to(device), device_ids=[device]) final_output = ddp.forward(input, input2) loss = loss_function(final_output) optimizer.zero_grad() loss.backward() optimizer.step()
I called broadcast_buffers=False so I didnt have an issue calling forward twice. In that case, is it fine if i call my custom function the way I did and will the gradients be correct?
model.module.function is not using the parameters in the model, it should work.
A little more details on my method. Pseudo code is
class model(nn.Module): def __init__(self) : super(model, self).__init__() self.encoder = Encoder() self.decoder = Decoder() self.mlp = MLP() def encode(self, x): return self.encoder(x) def decode(self, x): return self.decoder(x) def classify(self, a, b) return self.mlp(a, b) def forward(self, x): enc = self.encode(x) out = self.decode(enc) return enc, out # this is my main training script enc, out = model(x) enc2 = enc + d #d is some random perturbations out2 = model.module.decode(enc2) pred = model.module.classify(enc, enc2)
There are a bunch of other stuff, but in this scenario, my decode function is using the parameters in model? Would this be an issue? There are no errors when running.
how do yo compute the final loss (the one where
backward is launched from)? I assume both
out contribute to that loss? If so, this looks OK to me.
This should be an issue for your current use case, but I want to mention that this probably won’t work correctly with
find_unused_parameters=True mode. Because the
mlp is used outside of
forward, and DDP will find unused parameters using
forward output. So in that mode, DDP would treat parameters in
mlp as unused parameters although they are actually part of the autograd graph.
my loss functions is something like
loss1 = adv_loss(out) #make output image look realistic loss2 = adv_loss(out2) loss3 = adv_loss(enc) #make encoding normal distributed loss4 = adv_loss(enc2) loss5 = l1_loss(out, x) # reconstruction loss loss6 = l1_loss(out2, x) loss7 = cross_entropy_loss(pred, GT)
I dont have find_unuse_parameters=True and have no error. If i understand what you are saying, the gradients are fine?
Yes, I think the gradients should be fine.