Different results when using caffe and pytorch


(Nick Young) #1

I am training the same LSTM network architecture with caffe and pytorch. But they give very different results.
caffe’s model accuracy is about 98%, but the accuracy of pytorch version is just 50%. Why?


#2

the optimizers might be subtly different, where one’s learning rate or momentum scaling is a bit different than the other…


#3

hi I meet similar problem, the result of pytorch is worse than caffe’s. Have you solved it? Thank you!


(Nick Young) #4

I finally got a close result to the Caffe version after I clarified some differences between Caffe and Pytorch:

  1. SGD implementation.
  2. dropout is not applied if there is only one RNN layer.

and also I found the data preprocessing of my pytorch version is slightly different from the Caffe version.

After I solved these problems, I got a comparable result.

Hope this helps!


#5

Thanks for your reply! Could you please describe the first factor in detail? What difference matters?


(Nick Young) #6

You could refer to here: http://pytorch.org/docs/master/optim.html

In the Note of SGD:

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.


#7

Ok, I’ll check it. Thank you very much!


(dashesy) #8

I noticed this difference too. Whereas my training works fine in Caffe, in PyTorch if I change the learning rate for the same stages/iterations that caffe changes them (step-wise) suddenly I get nan loss values.
Looking at the difference here I thought I can change the adjust_learning_rate to also change the momentum like:

def adjust_learning_rate(optimizer, lr, momentum):
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
            param_group['momentum'] = momentum / lr

But still, as soon as lr is changed, the loss becomes nan

@Nick_Young how did you solve the problem with SGD discrepancy with Caffe?


(Nick Young) #9

As mentioned above. :wink:


(dashesy) #10

I ended up with this implementation of Caffe SGD. Appreciate if you can take a look.