Continue training from checkpoint returns high loss values, while reasonable with .eval()

rvarm1 · October 13, 2020, 5:55pm

In addition, you can try setting torch.backends.cudnn.enabled = False when training using SyncBatchNorm and DDP, as discussed in Training performance degrades with DistributedDataParallel.