Custom methods in DistributedDataParallel

I am trying to do multi gpu training with DistributedDataParallel. I wrap it around my model. However my model has a custom function that now i call by doing model.module.function(x). I was wondering if this is ok and if something bad will happen. Thanks

What does this custom function do? and when do you call this custom function? If it does not modify parameters and the autograd graph built during the forward pass, it should be OK.

The pseudo code is something like this

output = model(input)
output2 = model(input2)
final_output = model.module.function(output, output2)
loss = loss_function(final_output)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Would this be fine? The custom function is just a MLP to classify something. It does not change anything, but I want it to get updated when I call my optimizer.step()

If model is a DistributedDataParallel (DDP) instance, this won’t work. Because setup some internal states at the end of the forward pass, and does not work if you call forward twice without a backward in between.

However, this can be easily solve by wrapping the two forward and the function invocation into a wrapper model, and then pass that wrapper model to DDP, sth like:

class WrapperModel(nn.Module):
  def __init__(self, model) :
    super(WrapperModel, self).__init__()
    self.model = model

  def forward(input, input2): 
    output = model(input)
    output2 = model(input2)
    final_output = model.module.function(output, output2)
    return final_output

ddp = DistributedDataParallel(WrapperModel(model).to(device), device_ids=[device])

final_output = ddp.forward(input, input2)
loss = loss_function(final_output)
optimizer.zero_grad()
loss.backward()
optimizer.step()

I called broadcast_buffers=False so I didnt have an issue calling forward twice. In that case, is it fine if i call my custom function the way I did and will the gradients be correct?

If the model.module.function is not using the parameters in the model, it should work.

A little more details on my method. Pseudo code is

class model(nn.Module):
  def __init__(self) :
    super(model, self).__init__()
    self.encoder = Encoder()
    self.decoder = Decoder()
    self.mlp = MLP()
  def encode(self, x):
    return self.encoder(x)
  def decode(self, x): 
    return self.decoder(x)
  def classify(self, a, b)
    return self.mlp(a, b)
  def forward(self, x):
    enc = self.encode(x)
    out = self.decode(enc)
    return enc, out
# this is my main training script
enc, out = model(x)
enc2 = enc + d #d is some random perturbations
out2 = model.module.decode(enc2)
pred = model.module.classify(enc, enc2)

There are a bunch of other stuff, but in this scenario, my decode function is using the parameters in model? Would this be an issue? There are no errors when running.

how do yo compute the final loss (the one where backward is launched from)? I assume both end and out contribute to that loss? If so, this looks OK to me.

This should be an issue for your current use case, but I want to mention that this probably won’t work correctly with find_unused_parameters=True mode. Because the mlp is used outside of forward, and DDP will find unused parameters using forward output. So in that mode, DDP would treat parameters in mlp as unused parameters although they are actually part of the autograd graph.

my loss functions is something like

loss1 = adv_loss(out) #make output image look realistic
loss2 = adv_loss(out2)
loss3 = adv_loss(enc) #make encoding normal distributed
loss4 = adv_loss(enc2)
loss5 = l1_loss(out, x) # reconstruction loss
loss6 = l1_loss(out2, x)
loss7 = cross_entropy_loss(pred, GT)

I dont have find_unuse_parameters=True and have no error. If i understand what you are saying, the gradients are fine?

Yes, I think the gradients should be fine.

1 Like