Hi there!
I am using pytorch DDP to use multi-gpus and it seems to work well until I find out the model parameters across GPUs are different. (It should be same if all-reduce operation works properly)
I am debugging now but still not getting it.
My model wrapped with DDP looks like (doesn’t include “forward” function explicitly)
class MyModel(nn.Module):
def __init__(self, ...):
## Create other modules which have forward function
sub_model_a = SubModel(...)
sub_model_b = SubModel(...)
def func_a(self, image):
## Do something with sub_model_a
sub_model_a(image)
def func_b(self, image):
## Do something with sub_model_b
sub_model_b(image)
So my questions are
- Target module wrapped by DDP should have “forward” pass explicitly ?
- If not, how can I check all-reduce function works properly ? Now, I explictly check model parameter value
- Since model attributes are wrapped to .module after DDP, I use “func_a” and “func_b” method of MyModel class as follows
mymodel = DDP(MyModel, device_ids=[rank])
mymodel = mymodel.module
for epoch in range(num_epochs):
## Just example
loss = mymodel.func_a()
loss.backward()
This pattern would be problematic with DDP ??
Thank you in advance.