Hey,
I was just looking at the Adam implementation http://pytorch.org/docs/master/_modules/torch/optim/adam.html and found that the current version is a bit different from the original one in the paper:
The current implementation:

           # Decay the first and second moment running average coefficient

bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1



The denom here is computed before the bias correction terms, which means the final multiplier of learning rate changes from

(m_t / (1-{\beta_1} ^ t)) / (sqrt( v_t / (1 - {\beta_2} ^ t)) + \epsilon)

(in the paper) to

(m_t / (1-{\beta_1} ^ t)) / (sqrt(( v_t + \epsilon) / (1 - {\beta_2} ^ t)))
The difference is minor but is there a reason to implement in this way?

2 Likes

The tensorflow implementation is the same as the original paper.

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

I don’t fully understand what you meant. There are no divisions in the code that would lead to computing the denominator. Did you want to write out the full update rule?

Hey,
I revised my question trying ti make it more clear, could you take a look?

Would you mind to check it?

There is a rearranged version of the Adam in the paper.

It is mentioned in the last paragraph of section 2 Algorithm.

“Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines…”

The same is implemented in Torch/PyTorch.

3 Likes

Ah great! Thanks a lot for the pointer!

That is also the implementation they use for Adam in tensorflow fyi