Subtle difference of RMSprop between PyTorch and Tensorflow

I think this is a follow-up discussion of https://github.com/pytorch/pytorch/issues/23796.

It’s a bit of cliche, but when trying to reproduce a repo of Tensorflow using PyTorch, RMSprop again failed to align. Apart from eps inside/outside problem, I spotted another two differences, which I think should be open to discussion. (I could be wrong since I know little about all those optimization tricks.)

  1. Momentum is not multiplied by another lr in tensorflow’s implementation, compared to PyTorch implementation. To align, I think RMSprop should be
buf.mul_(group['momentum']).addcmul_(grad, group['lr'] / avg)
p.data.add_(-buf)
  1. Weight decay should be multiplied by 2 if I’m using PyTorch. As far as I understand, in Tensorflow/keras, the l2 loss is not multiplied by 1/2, the derivative of which should be multiplied by 2. However this 2 is missing in PyTorch’s implementation.

Correct me if I’m wrong. I’m also interested in why these differences exist if they are by design.

I guess the idea behind the update rule for buf here(rmsprop) is to stay consistent with that for sgd with momentum.

buf.mul_(momentum).add_(1 - dampening, d_p)

SGD Momentum implementation

In pytoch implementation, the momentum is slightly changed, refer to sgd doc

They are the same for Normal SGD and pytorch SGD when learning rate is fixed.

Hope this will be helpful.

On #1 - both equations end up having the same result…, if momentum_buffer=0 at t=0 (which is what pytorch uses as initialization, but am not sure TF)