How Adam optimizer influence the learning rate?

laro · December 17, 2022, 3:08pm

I read some papers about how ADAM optimizer works, and there are some issues which seems that are confusing:

ADAM equations are:

In the second formula, we squared the gradient.

1.1 Why we are doing it ?

1.2 What is the gain from this squared?

1.3 Are there any situations that the gradient squared (in the denominator) will give us less performance ?

I read that the disadvantage of AdaGrad algorithm is that as the number of iteration becomes very large learning rate decreases to a very small number which leads to slow convergence

It seems to me that ADAM algorithm has the same drawback (at a lower dose): we are dividing the learning rate with the second equation (which has the squared gradient).

With a lot of epochs and iterations the learning rate decreases to a very small number and we will meet the same drawback, no ?

Am I wrong ?

ADAM uses equations 1 and 2 to tune the learning rate.
On the other hand, I saw some examples using OneCycleLR scheduler with ADAM.

3.1 Does it seem that if that 2 algorithms (ADAM and OneCycleLR) work together to tune the learning rate on the same time, they may collide with each other ?
are they ?

3.2 Is it wrong to use OneCycleLR with ADAM ?