(m_t / (1-{\beta_1} ^ t)) / (sqrt(( v_t + \epsilon) / (1 - {\beta_2} ^ t)))
The difference is minor but is there a reason to implement in this way?
Thanks in advance!
I don’t fully understand what you meant. There are no divisions in the code that would lead to computing the denominator. Did you want to write out the full update rule?
There is a rearranged version of the Adam in the paper.
It is mentioned in the last paragraph of section 2 Algorithm.
“Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines…”