High loss when resuming the training

Hi all,

I am getting a high train loss when i try to resume it from the saved .pth file

Please let me know what could be the reason for it

You could have forgotten to load the state_dict of the optimizer after restoring the training, which could diverge the training. If that’s not the case, you might have changed the preprocessing of the data etc. so that the model sees “new” samples.
Without knowing more details these would be my guesses.

Hi ptrblck,

I am loading the state_dict of optimizer also. There are some randomness in data preparation.

Could that be the reason??

If the randomness is introduced only after resuming the training, it could be the reason.
However, if the input tensors use the same data augmentation / transformation it shouldn’t increase the loss unexpectedly.
I would recommend to check the model with a static input tensor, e.g. torch.ones before saving and after loading the model to make sure the output is the same (call model.eval() to disable e.g. dropout layers). If these outputs don’t match (up to floating point precision), the model loading itself seems to fail.

The randomness for data preparation was already present. There are no torch.ones used in the model. When I call model.eval() the val results are reproducible but the loss is high when I call model.train()

When continuing training, try lowering the starting learning rate. What optimizer are you using?

@J_Johnson I am using SDG
→ optimizer = optim.SGD(centerface.parameters(), lr=1e-2, momentum=0.9, weight_decay=0.0005)
→ exp_lr_scheduler = lr_scheduler.MultiStepLR(optimizer, milestones= [30, 90, 140], gamma=0.1)