I think this is a follow-up discussion of https://github.com/pytorch/pytorch/issues/23796.
It’s a bit of cliche, but when trying to reproduce a repo of Tensorflow using PyTorch, RMSprop again failed to align. Apart from eps inside/outside problem, I spotted another two differences, which I think should be open to discussion. (I could be wrong since I know little about all those optimization tricks.)
- Momentum is not multiplied by another lr in tensorflow’s implementation, compared to PyTorch implementation. To align, I think RMSprop should be
buf.mul_(group['momentum']).addcmul_(grad, group['lr'] / avg)
p.data.add_(-buf)
- Weight decay should be multiplied by 2 if I’m using PyTorch. As far as I understand, in Tensorflow/keras, the l2 loss is not multiplied by 1/2, the derivative of which should be multiplied by 2. However this 2 is missing in PyTorch’s implementation.
Correct me if I’m wrong. I’m also interested in why these differences exist if they are by design.