I read some papers about how ADAM
optimizer works, and there are some issues which seems that are confusing:
ADAM
equations are:
-
In the second formula, we squared the gradient.
1.1 Why we are doing it ?
1.2 What is the gain from this squared?
1.3 Are there any situations that the gradient squared (in the denominator) will give us less performance ?
I read that the disadvantage of AdaGrad
algorithm is that as the number of iteration becomes very large learning rate decreases to a very small number which leads to slow convergence
It seems to me that ADAM
algorithm has the same drawback (at a lower dose): we are dividing the learning rate with the second equation (which has the squared gradient).
With a lot of epochs and iterations the learning rate decreases to a very small number and we will meet the same drawback, no ?
- Am I wrong ?
ADAM
uses equations 1 and 2 to tune the learning rate.
On the other hand, I saw some examples using OneCycleLR scheduler with ADAM.
3.1 Does it seem that if that 2 algorithms (ADAM
and OneCycleLR
) work together to tune the learning rate on the same time, they may collide with each other ?
are they ?
3.2 Is it wrong to use OneCycleLR with ADAM
?