I read some papers about how
ADAM optimizer works, and there are some issues which seems that are confusing:
ADAM equations are:
In the second formula, we squared the gradient.
1.1 Why we are doing it ?
1.2 What is the gain from this squared?
1.3 Are there any situations that the gradient squared (in the denominator) will give us less performance ?
I read that the disadvantage of
AdaGrad algorithm is that
as the number of iteration becomes very large learning rate decreases to a very small number which leads to slow convergence
It seems to me that
ADAM algorithm has the same drawback (at a lower dose): we are dividing the learning rate with the second equation (which has the squared gradient).
With a lot of epochs and iterations the learning rate decreases to a very small number and we will meet the same drawback, no ?
- Am I wrong ?
ADAM uses equations 1 and 2 to tune the learning rate.
On the other hand, I saw some examples using OneCycleLR scheduler with ADAM.
3.1 Does it seem that if that 2 algorithms (
OneCycleLR) work together to tune the learning rate on the same time, they may collide with each other ?
are they ?
3.2 Is it wrong to use OneCycleLR with