About torch.nn.utils.clip_grad_norm

Diego · February 21, 2018, 3:51am

Hello I am trying to understand what this function does. I know it is used to prevent exploding gradients in a model and I understand what the norm of a vector is and I’m guessing that this function ‘clips’ the norm of a vector to a specific maximum value. But I would like to know how this prevents the exploding gradient problem and what exactly does it do the the model parameters. Help would be greatly appreciated. Thanks!

jpeg729 · February 21, 2018, 8:17am

In some cases you may find that each layer of your net amplifies the gradient it receives. This causes a problem because the lower layers of the net then get huge gradients and their updates will be far too large to allow the model to learn anything.

This function ‘clips’ the norm of the gradients by scaling the gradients down by the same amount in order to reduce the norm to an acceptable level. In practice this places a limit on the size of the parameter updates.

The hope is that this will ensure that your model gets reasonably sized gradients and that the corresponding updates will allow the model to learn.

Diego · February 21, 2018, 8:35am

I understand now. So since the norm is the distance of a vector from the origin you are forcing the gradient to not go too far from the position it was the last time you ran model.zero_grad(). Did I get that right?

jpeg729 · February 21, 2018, 8:51am

I think it is more correct to say that you are forcing the gradients to be reasonably small, which means that the parameter updates will not push the parameters too far from their previous values.

Diego · February 21, 2018, 8:54am

I see. So it’s similar to decreasing the learning rate?

jpeg729 · February 21, 2018, 8:57am

But only selectively. Small gradients cause small updates and those will use the full learning rate.

Huge gradients will be squashed and the corresponding updates will be smaller.

Diego · February 21, 2018, 9:08am

So it’s similar to decreasing the learning rate only for big gradients to perform small updates all the time.

jpeg729 · February 21, 2018, 9:15am

Yes. That is exactly what it does.

My previous answers assumed that lower layers would receive clipped gradients during the backpropagation. This is not the case. clip_grad_norm is applied after the entire backward pass.

Diego · February 21, 2018, 9:18am

So it just clips the resulting gradient right before performing one step. Correct?

jpeg729 · February 21, 2018, 9:18am

Assuming you apply it after calling loss.backward() and before calling optimizer.step(), then yes.

Diego · February 21, 2018, 9:19am

Yes. That’s how I used it. Thanks a lot you have been a great help!

Aiden1 · March 23, 2022, 4:14pm

Thanks for the good thread, I’m wondering what is the intuition behind max_clip value? I’ve seen people use 1 or 10 etc, how to set your max value?

Thanks