I’ve noticed something peculiar in DDP training, but cannot seem to find any related posts.
I have an ordinary ConvNet, whose forward()
is defined in the following manner:
class MyModel(Module):
...
def forward(self, x):
mid = self.feature_extractor(x)
return self.classifier(mid)
def feature_extractor(self, x):
# Some layers here
def classifier(self, x):
# Some layers here
In some cases, I want the mid-level features, and in others, I don’t want them at all. Also, sometimes I have mid-level features from elsewhere, and just want a forward pass through the classifier. Therefore, instead of adding a conditional to my forward
method, I’ve split the forward
into two sub-methods, so I can use the feature_extractor
and classifier
methods as needed.
It turns out, in DDP (maybe for DP as well?) there is a difference between training with:
-
out = model(x)
, and mid = model.module.feature_extractor(x); out = model.module.classifier(mid)
In the case of (1), the model trains perfectly as expected. However, in the case of (2), the loss does not converge as efficiently as expected, and may not even converge to the same value as (1).
My question is: What would be the underlying mechanism that makes a difference between (1) and (2)? Could it be that (2) does not perform gradient synchronization across processes? Or perhaps nn.SyncBatchNorm
is not working as expected?
Thanks