Gradient clipping

osm3000 · May 10, 2017, 12:54pm

Hi everyone,

I am working on implementing Alex Graves model for handwriting synthesis (this is is the link)

In page 23, he mentions the output derivatives and LSTM derivatives

How can I do this part in PyTorch?

Thank you,
Omar

danelliottster · May 10, 2017, 6:13pm

Something like?

param.grad.data.clamp_(-1, 1)

osm3000 · May 10, 2017, 6:16pm

Thank you for your reply @danelliottster
But what is the output derivatives and the LSTM derivatives? How can I extract them?
param.grad is gradient for the parameters of the network
Are the LSTM derivatives == the gradient of the parameters of the LSTM?

danelliottster · May 10, 2017, 6:21pm

That is just what I was assuming. Sadly, I don’t know much about LSTM implementation in or out of PyTorch. I’m sure someone else can be more helpful…

osm3000 · May 11, 2017, 6:22am

Thank you @danelliottster. I am almost sure now the LSTM derivatives is the gradient of the parameters, but I don’t know what is the output derivatives or how to extract them yet

tom · May 12, 2017, 7:42pm

Hi,

note that by doing the backward and then using param.grad.data.clamp, you are only clipping the final gradient, not the gradients of outputs fed into inputs during the back propagation’s chain rule evauation.
If you want the latter, you would want to create an autograd function that is the identity in forward and clips the gradient in backward. (I don’t know for sure whether pytorch has that.)

Best regards

Thomas

jekbradbury · May 13, 2017, 12:29am

I’m not sure what the author is referring to by LSTM gradients, but you definitely want to use a backward hook that clips the gradient rather than create a new autograd function.

osm3000 · May 15, 2017, 12:43pm

thank you @tom and @jekbradbury
@jekbradbury: it works fine for me when clipping the LSTM gradients, but the output gradient is not clear for me at all

spro · May 15, 2017, 4:03pm

From eq. 1, 2, and 3 it seems he is making a distinction between the LSTM hidden layers (h_t) and the final output layer before the softmax (yhat)

tom · May 15, 2017, 5:29pm

Indeed. The original code seems to be available, too, and the function used for clipping is bound_range:

$ git clone https://github.com/szcom/rnnlib
$ rgrep bound_range  rnnlib/
rnnlib/src/MixtureOutputLayer.hpp:    bound_range(paramSigmaXY, almostZero, realMax);
rnnlib/src/MixtureOutputLayer.hpp:        bound_range(inputErrors[pt], -100.0, 100.0);
rnnlib/src/Helpers.hpp:static void bound_range(R &r,
rnnlib/src/Lstm1dLayer.hpp:      bound_range(inErrs, -10.0, 10.0);
rnnlib/src/LstmLayer.hpp:      bound_range(inErrs, -10.0, 10.0);
rnnlib/src/CharWindowLayer.hpp:      bound_range(inputErrors[coords], -10.0, 10.0);

I do look forward looking at pytorch code instead.

Best regards

Thomas

DingKe · May 16, 2017, 5:30am

as @jekbradbury suggested, gradient-clipping can be defined in a theano-like way:

def clip_grad(v, min, max):
    v.register_hook(lambda g: g.clamp(min, max))
    return v

A demo LSTM implementation with gradient clipping can be found here.

Brando_Miranda · August 23, 2017, 4:14pm

its really annoying cuz clamp_ is not documented. I guess one just has to randomly try functions with an underscore at the end and see if they exist. Thanks though.

similar Q:

ulfaslak · November 7, 2017, 3:26pm

Would you mind sharing your implementation?

raulpuric · November 7, 2017, 10:56pm

I just always do this
def clip_gradient(model, clip):
""“Clip the gradient.”""
if clip is None:
return
totalnorm = 0
for p in model.parameters():
if p.grad is None:
continue
p.grad.data = p.grad.data.clamp(-clip,clip)

and follow it up with a normal step call to my optimizer (my code isn’t formatting properly but I think u get the point)

Lee_Jim · September 15, 2018, 2:36pm

no sx or not about it, doesn’t matter

hughperkins · January 9, 2019, 11:59am

@raulpuric In fact, your code does the same thing as torch.nn.utils.clip_grad_norm_ , as far as I can see? ie, you’re going to call this between loss.backward() and opt.step(), and clip the gradients after the full backprop has taken place?

@tom what things do you recommend applying the hook to? Are you just applying it to the state from each time step? or … ? What happens if one is using an nn.LSTM, rather than an nn.LSTMCell? Will the hook get called each timestep? Or is it obligatory to use nn.LSTMCell if we want to clip the gradients at each timestep?

tom · January 11, 2019, 9:02am

You can’t hook anything in the timesteps if you don’t see them, so currently you need to do the looping yourself, making LSTMCell a natural choice.
I hope that PR 14957 brings manual LSTM performance back into the “awesome” region (but didn’t time it yet), then we’d have to see how to get clipping in there.

Best regards

Thomas