I’ve noticed something peculiar in DDP training, but cannot seem to find any related posts.
I have an ordinary ConvNet, whose
forward() is defined in the following manner:
class MyModel(Module): ... def forward(self, x): mid = self.feature_extractor(x) return self.classifier(mid) def feature_extractor(self, x): # Some layers here def classifier(self, x): # Some layers here
In some cases, I want the mid-level features, and in others, I don’t want them at all. Also, sometimes I have mid-level features from elsewhere, and just want a forward pass through the classifier. Therefore, instead of adding a conditional to my
forward method, I’ve split the
forward into two sub-methods, so I can use the
classifier methods as needed.
It turns out, in DDP (maybe for DP as well?) there is a difference between training with:
out = model(x), and
mid = model.module.feature_extractor(x); out = model.module.classifier(mid)
In the case of (1), the model trains perfectly as expected. However, in the case of (2), the loss does not converge as efficiently as expected, and may not even converge to the same value as (1).
My question is: What would be the underlying mechanism that makes a difference between (1) and (2)? Could it be that (2) does not perform gradient synchronization across processes? Or perhaps
nn.SyncBatchNorm is not working as expected?