In SGD optimizer code,
if momentum != 0: param_state = self.state[p] if 'momentum_buffer' not in param_state: param_state['momentum_buffer'] = d_p.clone() else: buf = param_state['momentum_buffer'] d_p = buf.mul_(momentum).add_(1 - dampening, d_p) p.data.add_(-group['lr'], d_p)
It seems that the update in pytorch is something like this:
v = momentum * v + (1-damping) * dp p = p - lr * v
I felt just weird that learning rate is multiplied by both terms. Instead, I expect that
v = momentum * v + (1-damping) * lr * dp p = p - v
I just want to know it is a bug or intended.
My second question is that in optim codes, in-place operators are commonly used. In this docs, I read in-place operators are not useful.
So is this fine to use like this
p = p - self.lr * dp? I wonder that there are more benefits of in-place operators when coding optimizers.