Different results when using caffe and pytorch

I noticed this difference too. Whereas my training works fine in Caffe, in PyTorch if I change the learning rate for the same stages/iterations that caffe changes them (step-wise) suddenly I get nan loss values.
Looking at the difference here I thought I can change the adjust_learning_rate to also change the momentum like:

def adjust_learning_rate(optimizer, lr, momentum):
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
            param_group['momentum'] = momentum / lr

But still, as soon as lr is changed, the loss becomes nan

@Nick_Young how did you solve the problem with SGD discrepancy with Caffe?