Different results when using caffe and pytorch

Nick_Young · May 19, 2017, 2:48pm

I am training the same LSTM network architecture with caffe and pytorch. But they give very different results.
caffe’s model accuracy is about 98%, but the accuracy of pytorch version is just 50%. Why?

smth · May 19, 2017, 2:57pm

the optimizers might be subtly different, where one’s learning rate or momentum scaling is a bit different than the other…

MeowLady · April 9, 2018, 10:53am

hi I meet similar problem, the result of pytorch is worse than caffe’s. Have you solved it? Thank you!

Nick_Young · April 10, 2018, 3:29am

I finally got a close result to the Caffe version after I clarified some differences between Caffe and Pytorch:

SGD implementation.
dropout is not applied if there is only one RNN layer.

and also I found the data preprocessing of my pytorch version is slightly different from the Caffe version.

After I solved these problems, I got a comparable result.

Hope this helps!

MeowLady · April 10, 2018, 9:20am

Thanks for your reply! Could you please describe the first factor in detail? What difference matters?

Nick_Young · April 10, 2018, 11:11am

You could refer to here: torch.optim — PyTorch master documentation

In the Note of SGD:

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

MeowLady · April 10, 2018, 11:15am

Ok, I’ll check it. Thank you very much!

dashesy · June 19, 2018, 8:13pm

I noticed this difference too. Whereas my training works fine in Caffe, in PyTorch if I change the learning rate for the same stages/iterations that caffe changes them (step-wise) suddenly I get nan loss values.
Looking at the difference here I thought I can change the adjust_learning_rate to also change the momentum like:

def adjust_learning_rate(optimizer, lr, momentum):
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
            param_group['momentum'] = momentum / lr

But still, as soon as lr is changed, the loss becomes nan

@Nick_Young how did you solve the problem with SGD discrepancy with Caffe?

Nick_Young · July 14, 2018, 3:18am

As mentioned above.

dashesy · July 14, 2018, 4:55am

I ended up with this implementation of Caffe SGD. Appreciate if you can take a look.