I am currently training a model where the loss explodes if I use a higher learning rate. So the highest learning rate I can use is like 1e-3. The loss even goes to NaN after the first iteration, which was a bit surprising to me.

I thought I had some bug but the model does converge after a long time and I get meaningful predictions from it.

I am currently not using any regularization and no weight decay.

I was wondering if there is some advise on what I could try and do to push the learning rate up.

High learning rate cannot ensure the fast training.

Anyway, going to NaN is not good at all. Does your model normalize the features by using Batchnorm or something? And activation functions are needed in most of cases.

Just to be clear, this all depends on the details of your model and data.

Sometimes if your training is initially unstable (your â€śloss explodesâ€ť) it
can be worth training with a small learning rate for a â€śwarm-upâ€ť period
of 10 or 100 iterations or epochs and then see if you can increase the
learning rate.

The intuition is that your randomly-initialized weights start out, by
happenstance, on the side of a steep â€śhill,â€ť so gradient descent has
you taking excessively large steps that just move you to other bad
places in configuration space â€“ jumping across â€śvalleys,â€ť and such.

If you train with a small learning rate for a while, you move downhill
into the â€śvalleyâ€ť where the weights are more sensible, the gradients
are not so large, and you can safely increase your learning rate.

(Note, it is also often recommended â€“ whether or not you start out
with a smaller warm-up learning rate â€“ to train for a while with a larger
learning rate so that you minimize your loss with respect to the coarse
topography of the loss landscape, and then train further with with a
smaller learning rate so that you can minimize your loss within the
finer feature of the loss landscape. In general, this is what learning-rate
schedulers are about.)

Although using â€śmomentumâ€ť is not the same as increasing the learning
rate, it does have some similarities. Try using momentum with your
optimizer and see if you training progresses more quickly. You could
also experiment with the Adam (â€śadaptive momentumâ€ť) optimizer. It
can be less stable than plain gradient descent but in many problems
it can speed up training dramatically. (You will need to adjust your
learning rate if you add momentum or switch to Adam because the
learning rate and momentum terms interact with one another.)

If itâ€™s not too hard for you to modify your model, you could try adding BatchNorm to some of your layers. I havenâ€™t experimented with this
myself, but some of the lore says that using BatchNorm will let you
increase your learning rate.