Continue training from checkpoint returns high loss values, while reasonable with .eval()

yairkit · October 10, 2020, 8:28am

I load my trained model from checkpoint for a fine-tune training.

then when I do:
model.eval()
model(x)
output seems OK, loss scale is same as at the end of pre-train.

but for:
model.train()
model(x)
output is totally different, very bad - just like it’s a “training from scratch”.

Am I doing something wrong?
Thanks

Kushaj · October 10, 2020, 10:18am

For model.train are you also updating the parameters? If you want to do inference on your model you should use

with torch.no_grad():
    model.eval()
    # inference

rvarm1 · October 13, 2020, 5:55pm

In addition, you can try setting torch.backends.cudnn.enabled = False when training using SyncBatchNorm and DDP, as discussed in Training performance degrades with DistributedDataParallel.