Suboptimal convergence when compared with TensorFlow model

at first glance. it looks like tensor flow is using a slightly different version of episilon definition in their Adam . They are using the the “epsilon hat” version

they replace these three lines of algorithm:

m t ←mt/(1−β1t)(Computebias-correctedfirstmomentestimate)
v t ← vt /(1 − β2t ) (Compute bias-corrected second raw moment estimate) √
θt ←θt−1 −α·m t/( v t +ε)(Updateparameters)

with these two lines:

αt =α· 1−β2t/(1−β1t)
θt ←θt−1 −αt ·mt/(√vt +εˆ).

and just took a glance at keras and seems they are too

EDIT: Scratch that. We are using the same here as well. We use the bottom two lines as well

3 Likes