at first glance. it looks like tensor flow is using a slightly different version of episilon definition in their Adam . They are using the the “epsilon hat” version
they replace these three lines of algorithm:
m t ←mt/(1−β1t)(Computebias-correctedfirstmomentestimate)
v t ← vt /(1 − β2t ) (Compute bias-corrected second raw moment estimate) √
θt ←θt−1 −α·m t/( v t +ε)(Updateparameters)
with these two lines:
αt =α· 1−β2t/(1−β1t)
θt ←θt−1 −αt ·mt/(√vt +εˆ).
and just took a glance at keras and seems they are too
EDIT: Scratch that. We are using the same here as well. We use the bottom two lines as well