Model parameters across GPUs are different with DDP

yongjun_Hong · December 25, 2021, 11:02am

Hi there!

I am using pytorch DDP to use multi-gpus and it seems to work well until I find out the model parameters across GPUs are different. (It should be same if all-reduce operation works properly)

I am debugging now but still not getting it.

My model wrapped with DDP looks like (doesn’t include “forward” function explicitly)

class MyModel(nn.Module):
    def __init__(self, ...):
         ## Create other modules which have forward function
         sub_model_a = SubModel(...)
         sub_model_b = SubModel(...)

    def func_a(self, image):
         ## Do something with sub_model_a
         sub_model_a(image)
    def func_b(self, image):
         ## Do something with sub_model_b
         sub_model_b(image)

So my questions are

Target module wrapped by DDP should have “forward” pass explicitly ?
If not, how can I check all-reduce function works properly ? Now, I explictly check model parameter value
Since model attributes are wrapped to .module after DDP, I use “func_a” and “func_b” method of MyModel class as follows

mymodel = DDP(MyModel, device_ids=[rank])
mymodel = mymodel.module

for epoch in range(num_epochs):
    ## Just example
    loss = mymodel.func_a()
    loss.backward()

This pattern would be problematic with DDP ??

Thank you in advance.

ptrblck · December 29, 2021, 9:38pm

DDP broadcasts the state_dict during its construction as described in the internal design.
If I understand your use case correctly, you are bypassing the DDP model by manually calling into the internal .module? If so, what’s your use case that would need this approach?

yongjun_Hong · January 6, 2022, 1:21am

Thanks for an answer.

Yes, you are right, I was actually bypassing the DDP model by manually calling “.module” of DDP wrapped model. I think that’s why model parameters across GPUs were different.

Now I created explict forward function in MyModel class, and use it instead of “func_a, func_b”, model parameters across GPUs were same. I think that DDP wrapped model trigger synchronization with forward pass