I noticed this difference too. Whereas my training works fine in Caffe, in PyTorch if I change the learning rate for the same stages/iterations that caffe changes them (step-wise) suddenly I get nan
loss values.
Looking at the difference here I thought I can change the adjust_learning_rate
to also change the momentum like:
def adjust_learning_rate(optimizer, lr, momentum):
for param_group in optimizer.param_groups:
param_group['lr'] = lr
param_group['momentum'] = momentum / lr
But still, as soon as lr
is changed, the loss becomes nan
@Nick_Young how did you solve the problem with SGD discrepancy with Caffe?