Momentum update and in-place operators in optimizers

Ja-Keoung_Koo · February 24, 2017, 1:37pm

In SGD optimizer code,

                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        param_state['momentum_buffer'] = d_p.clone()
                    else:
                        buf = param_state['momentum_buffer']
                        d_p = buf.mul_(momentum).add_(1 - dampening, d_p)

                p.data.add_(-group['lr'], d_p)

It seems that the update in pytorch is something like this:

v = momentum * v + (1-damping) * dp
p = p - lr * v

I felt just weird that learning rate is multiplied by both terms. Instead, I expect that

v = momentum * v + (1-damping) * lr * dp
p = p - v

I just want to know it is a bug or intended.

My second question is that in optim codes, in-place operators are commonly used. In this docs, I read in-place operators are not useful.

So is this fine to use like this p = p - self.lr * dp? I wonder that there are more benefits of in-place operators when coding optimizers.

apaszke · February 24, 2017, 9:41pm

It’s intended. I think some frameworks use one definition, while others use the second one.

In-place operations are not encouraged if they don’t have to be used, but you have to use them in the optimizer, because you want to actually modify the parameters of the model. You don’t have access to the original module/container/whatever object is holding the parameters, so p = p - self.lr * dp wouldn’t work, because you can’t reassign the p in the object that holds it.