Upto my understanding, it is a regularization technique, because it helps to learn model correctly and in generalization. But I am still confused at whether it would be correct or not to call it a “regularization” method.??

Thank you!

Upto my understanding, it is a regularization technique, because it helps to learn model correctly and in generalization. But I am still confused at whether it would be correct or not to call it a “regularization” method.??

Thank you!

Hi Sahil!

I would not call learning-rate decay *regularization.*

Of course, we’re discussing semantics – the meaning of the term

“regularization.” But I think it has a fairly standard definition – or at

least connotation – in this context.

We have a loss function as a function of parameter space, and an

optimization algorithm that seeks to move around in parameter space

to locations with lower losses. (In practice, with neural networks, we

neither find the global minimum, nor even a local minimum. We just

want to find a “good enough” location in parameter space with a “low

enough” value of the loss.)

Learning-rate decay is just part of the optimization algorithm, just as

is, say, adding momentum to gradient descent or fancier things like

adaptive momentum. These don’t change where in parameter space

you want to go – they just try to get you there faster.

*Regularization,* however, changes the loss function so that different

regions of parameter space are favored. For example, *weight decay,*

which we often think of as being part of the optimization algorithm, is

essentially equivalent to adding an L2 regularizer to the loss function.

This means we prefer weights that are smaller, even if they would lead

to a (modestly) larger unregularized loss. This can be used to prevent

weights from becoming very large and producing numerical instability.

It can also lead to weights that generalize better. But, to emphasize,

it changes which points in parameter space are favored (by the

regularized loss) relative to other points.

Similarly, one can add sparsity penalties (e.g., L1 regularizers) to the

loss function to prefer having more parameters that are close to zero.

Again, in many instances, such sparser parameter sets generalize

better, *even though* they have larger unregularized losses computed

for the training data.

To summarize: Regularization modifies the loss, changing which regions

of parameter space are favored, while things like learning-rate decay

modify the optimization algorithm so that it moves you to the favored

regions of parameter space more efficiently.

Best.

K. Frank

3 Likes

Thank you so much for your valuable feedback.

Would you consider then LRReduceOnPlateau and EarlyStopping as regularization as this blog “Prevent Overfitting of Neural Networks -” is doing?

Thank you!

Hi Sahil!

Again, it’s a question of semantics, but I would not call techniques such

as early stopping and the algorithm in pytorch’s `ReduceLROnPlateau`

kinds of regularization. They are overlays on top of the optimization

algorithm that take a peek at some validation results. But they don’t

in any substantive sense affect the regions in parameter towards which

the optimization algorithm moves you.

That is, in my language, the weights in the network are not being in

any way regularized.

Regularization can be used to reduce overfitting, but I would not say

that anything that reduces overfitting is regularization. For example,

training on more data (or sometimes on augmented data) will reduce

overfitting (and will move you towards different regions of parameter

space), but I would not call training on more data “regularization.”

Best.

K. Frank

1 Like