Question about learning rate

if the gradient range from 1e^(-5) to 1e^(-4),what orders of magnitude should the learning rate be?
Thanks a lot.

I don’t know—I think the optimal learning rate has to do with other aspects of the problem besides gradient scale. That said, SGD and other optimizers like Adam use the learning rate in different ways. Adam automatically adapts the step size to the gradient scale on a per-parameter basis; its ‘learning rate’ hyperparameter sets the maximum any parameter may change per step. Often Adam’s default learning rate of 1e-3 is ‘good enough’ for most problems and people don’t tune it further. But try and see!

1 Like

Thank you very much.I have got the idea.