yairkit
(Yairkit)
1
I load my trained model from checkpoint for a fine-tune training.
then when I do:
model.eval()
model(x)
output seems OK, loss scale is same as at the end of pre-train.
but for:
model.train()
model(x)
output is totally different, very bad - just like it’s a “training from scratch”.
- the model pretrained with DDP
- the model has BN layers
Am I doing something wrong?
Thanks
Kushaj
(Kushajveer Singh)
2
For model.train
are you also updating the parameters? If you want to do inference on your model you should use
with torch.no_grad():
model.eval()
# inference
rvarm1
(Rohan Varma)
3
In addition, you can try setting torch.backends.cudnn.enabled = False
when training using SyncBatchNorm and DDP, as discussed in Training performance degrades with DistributedDataParallel.