Is learning rate decay a regularization technique?

Sahil_malik · February 8, 2021, 11:55pm

Upto my understanding, it is a regularization technique, because it helps to learn model correctly and in generalization. But I am still confused at whether it would be correct or not to call it a “regularization” method.??

Thank you!

KFrank · February 9, 2021, 4:10pm

Hi Sahil!

I would not call learning-rate decay regularization.

Of course, we’re discussing semantics – the meaning of the term
“regularization.” But I think it has a fairly standard definition – or at
least connotation – in this context.

We have a loss function as a function of parameter space, and an
optimization algorithm that seeks to move around in parameter space
to locations with lower losses. (In practice, with neural networks, we
neither find the global minimum, nor even a local minimum. We just
want to find a “good enough” location in parameter space with a “low
enough” value of the loss.)

Learning-rate decay is just part of the optimization algorithm, just as
is, say, adding momentum to gradient descent or fancier things like
adaptive momentum. These don’t change where in parameter space
you want to go – they just try to get you there faster.

Regularization, however, changes the loss function so that different
regions of parameter space are favored. For example, weight decay,
which we often think of as being part of the optimization algorithm, is
essentially equivalent to adding an L2 regularizer to the loss function.
This means we prefer weights that are smaller, even if they would lead
to a (modestly) larger unregularized loss. This can be used to prevent
weights from becoming very large and producing numerical instability.
It can also lead to weights that generalize better. But, to emphasize,
it changes which points in parameter space are favored (by the
regularized loss) relative to other points.

Similarly, one can add sparsity penalties (e.g., L1 regularizers) to the
loss function to prefer having more parameters that are close to zero.
Again, in many instances, such sparser parameter sets generalize
better, even though they have larger unregularized losses computed
for the training data.

To summarize: Regularization modifies the loss, changing which regions
of parameter space are favored, while things like learning-rate decay
modify the optimization algorithm so that it moves you to the favored
regions of parameter space more efficiently.

Best.

K. Frank

Sahil_malik · February 9, 2021, 6:09pm

Thank you so much for your valuable feedback.

Would you consider then LRReduceOnPlateau and EarlyStopping as regularization as this blog “Prevent Overfitting of Neural Networks -” is doing?

Thank you!

KFrank · February 10, 2021, 2:33am

Hi Sahil!

Again, it’s a question of semantics, but I would not call techniques such
as early stopping and the algorithm in pytorch’s ReduceLROnPlateau
kinds of regularization. They are overlays on top of the optimization
algorithm that take a peek at some validation results. But they don’t
in any substantive sense affect the regions in parameter towards which
the optimization algorithm moves you.

That is, in my language, the weights in the network are not being in
any way regularized.

Regularization can be used to reduce overfitting, but I would not say
that anything that reduces overfitting is regularization. For example,
training on more data (or sometimes on augmented data) will reduce
overfitting (and will move you towards different regions of parameter
space), but I would not call training on more data “regularization.”

Best.

K. Frank