Continue training from checkpoint returns high loss values, while reasonable with .eval()

I load my trained model from checkpoint for a fine-tune training.

then when I do:
model.eval()
model(x)
output seems OK, loss scale is same as at the end of pre-train.

but for:
model.train()
model(x)
output is totally different, very bad - just like it’s a “training from scratch”.

  • the model pretrained with DDP
  • the model has BN layers

Am I doing something wrong?
Thanks

For model.train are you also updating the parameters? If you want to do inference on your model you should use

with torch.no_grad():
    model.eval()
    # inference

In addition, you can try setting torch.backends.cudnn.enabled = False when training using SyncBatchNorm and DDP, as discussed in Training performance degrades with DistributedDataParallel.