Adam implementation

jingweiz · November 16, 2017, 2:28pm

Hey,
I was just looking at the Adam implementation http://pytorch.org/docs/master/_modules/torch/optim/adam.html and found that the current version is a bit different from the original one in the paper:
The current implementation:

           # Decay the first and second moment running average coefficient
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

                denom = exp_avg_sq.sqrt().add_(group['eps'])

                bias_correction1 = 1 - beta1 ** state['step']
                bias_correction2 = 1 - beta2 ** state['step']
                step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

                p.data.addcdiv_(-step_size, exp_avg, denom)

The denom here is computed before the bias correction terms, which means the final multiplier of learning rate changes from

(m_t / (1-{\beta_1} ^ t)) / (sqrt( v_t / (1 - {\beta_2} ^ t)) + \epsilon)

(in the paper) to

(m_t / (1-{\beta_1} ^ t)) / (sqrt(( v_t + \epsilon) / (1 - {\beta_2} ^ t)))
The difference is minor but is there a reason to implement in this way?
Thanks in advance!

onlytailei · November 16, 2017, 2:53pm

Yes, I’m also confused about this part.
The tensorflow implementation is the same as the original paper.

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

apaszke · November 16, 2017, 8:18pm

I don’t fully understand what you meant. There are no divisions in the code that would lead to computing the denominator. Did you want to write out the full update rule?

jingweiz · November 17, 2017, 6:29pm

Hey,
I revised my question trying ti make it more clear, could you take a look?

onlytailei · November 20, 2017, 5:47pm

Would you mind to check it?

anand.saha · November 23, 2017, 6:52am

There is a rearranged version of the Adam in the paper.

It is mentioned in the last paragraph of section 2 Algorithm.

“Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines…”

The same is implemented in Torch/PyTorch.

–

jingweiz · November 23, 2017, 10:36am

Ah great! Thanks a lot for the pointer!

dgriff · November 23, 2017, 12:48pm

That is also the implementation they use for Adam in tensorflow fyi