Model parameters across GPUs are different with DDP

Hi there!

I am using pytorch DDP to use multi-gpus and it seems to work well until I find out the model parameters across GPUs are different. (It should be same if all-reduce operation works properly)

I am debugging now but still not getting it.

My model wrapped with DDP looks like (doesn’t include “forward” function explicitly)

class MyModel(nn.Module):
    def __init__(self, ...):
         ## Create other modules which have forward function
         sub_model_a = SubModel(...)
         sub_model_b = SubModel(...)

    def func_a(self, image):
         ## Do something with sub_model_a
         sub_model_a(image)
    def func_b(self, image):
         ## Do something with sub_model_b
         sub_model_b(image)

So my questions are

  1. Target module wrapped by DDP should have “forward” pass explicitly ?
  2. If not, how can I check all-reduce function works properly ? Now, I explictly check model parameter value
  3. Since model attributes are wrapped to .module after DDP, I use “func_a” and “func_b” method of MyModel class as follows
mymodel = DDP(MyModel, device_ids=[rank])
mymodel = mymodel.module

for epoch in range(num_epochs):
    ## Just example
    loss = mymodel.func_a()
    loss.backward()

This pattern would be problematic with DDP ??

Thank you in advance.

DDP broadcasts the state_dict during its construction as described in the internal design.
If I understand your use case correctly, you are bypassing the DDP model by manually calling into the internal .module? If so, what’s your use case that would need this approach?

Thanks for an answer.

Yes, you are right, I was actually bypassing the DDP model by manually calling “.module” of DDP wrapped model. I think that’s why model parameters across GPUs were different.

Now I created explict forward function in MyModel class, and use it instead of “func_a, func_b”, model parameters across GPUs were same. I think that DDP wrapped model trigger synchronization with forward pass