In SGD optimizer code,
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
param_state['momentum_buffer'] = d_p.clone()
else:
buf = param_state['momentum_buffer']
d_p = buf.mul_(momentum).add_(1 - dampening, d_p)
p.data.add_(-group['lr'], d_p)
It seems that the update in pytorch is something like this:
v = momentum * v + (1-damping) * dp
p = p - lr * v
I felt just weird that learning rate is multiplied by both terms. Instead, I expect that
v = momentum * v + (1-damping) * lr * dp
p = p - v
I just want to know it is a bug or intended.
My second question is that in optim codes, in-place operators are commonly used. In this docs, I read in-place operators are not useful.
So is this fine to use like this p = p - self.lr * dp
? I wonder that there are more benefits of in-place operators when coding optimizers.