Hello I am trying to understand what this function does. I know it is used to prevent exploding gradients in a model and I understand what the norm of a vector is and I’m guessing that this function ‘clips’ the norm of a vector to a specific maximum value. But I would like to know how this prevents the exploding gradient problem and what exactly does it do the the model parameters. Help would be greatly appreciated. Thanks!
In some cases you may find that each layer of your net amplifies the gradient it receives. This causes a problem because the lower layers of the net then get huge gradients and their updates will be far too large to allow the model to learn anything.
This function ‘clips’ the norm of the gradients by scaling the gradients down by the same amount in order to reduce the norm to an acceptable level. In practice this places a limit on the size of the parameter updates.
The hope is that this will ensure that your model gets reasonably sized gradients and that the corresponding updates will allow the model to learn.
I understand now. So since the norm is the distance of a vector from the origin you are forcing the gradient to not go too far from the position it was the last time you ran model.zero_grad(). Did I get that right?
I think it is more correct to say that you are forcing the gradients to be reasonably small, which means that the parameter updates will not push the parameters too far from their previous values.
I see. So it’s similar to decreasing the learning rate?
But only selectively. Small gradients cause small updates and those will use the full learning rate.
Huge gradients will be squashed and the corresponding updates will be smaller.
So it’s similar to decreasing the learning rate only for big gradients to perform small updates all the time.
Yes. That is exactly what it does.
My previous answers assumed that lower layers would receive clipped gradients during the backpropagation. This is not the case. clip_grad_norm is applied after the entire backward pass.
So it just clips the resulting gradient right before performing one step. Correct?
Assuming you apply it after calling loss.backward()
and before calling optimizer.step()
, then yes.
Yes. That’s how I used it. Thanks a lot you have been a great help!
Thanks for the good thread, I’m wondering what is the intuition behind max_clip value? I’ve seen people use 1 or 10 etc, how to set your max value?
Thanks