Gradient clipping

Hi everyone,

I am working on implementing Alex Graves model for handwriting synthesis (this is is the link)

In page 23, he mentions the output derivatives and LSTM derivatives

How can I do this part in PyTorch?

Thank you,


Something like?, 1)

Thank you for your reply @danelliottster
But what is the output derivatives and the LSTM derivatives? How can I extract them?
param.grad is gradient for the parameters of the network
Are the LSTM derivatives == the gradient of the parameters of the LSTM?

That is just what I was assuming. Sadly, I don’t know much about LSTM implementation in or out of PyTorch. I’m sure someone else can be more helpful…

Thank you @danelliottster. I am almost sure now the LSTM derivatives is the gradient of the parameters, but I don’t know what is the output derivatives or how to extract them yet


note that by doing the backward and then using, you are only clipping the final gradient, not the gradients of outputs fed into inputs during the back propagation’s chain rule evauation.
If you want the latter, you would want to create an autograd function that is the identity in forward and clips the gradient in backward. (I don’t know for sure whether pytorch has that.)

Best regards



I’m not sure what the author is referring to by LSTM gradients, but you definitely want to use a backward hook that clips the gradient rather than create a new autograd function.


thank you @tom and @jekbradbury
@jekbradbury: it works fine for me when clipping the LSTM gradients, but the output gradient is not clear for me at all

From eq. 1, 2, and 3 it seems he is making a distinction between the LSTM hidden layers (h_t) and the final output layer before the softmax (yhat)


Indeed. The original code seems to be available, too, and the function used for clipping is bound_range:

$ git clone
$ rgrep bound_range  rnnlib/
rnnlib/src/MixtureOutputLayer.hpp:    bound_range(paramSigmaXY, almostZero, realMax);
rnnlib/src/MixtureOutputLayer.hpp:        bound_range(inputErrors[pt], -100.0, 100.0);
rnnlib/src/Helpers.hpp:static void bound_range(R &r,
rnnlib/src/Lstm1dLayer.hpp:      bound_range(inErrs, -10.0, 10.0);
rnnlib/src/LstmLayer.hpp:      bound_range(inErrs, -10.0, 10.0);
rnnlib/src/CharWindowLayer.hpp:      bound_range(inputErrors[coords], -10.0, 10.0);

I do look forward looking at pytorch code instead. :slight_smile:

Best regards



as @jekbradbury suggested, gradient-clipping can be defined in a theano-like way:

def clip_grad(v, min, max):
    v.register_hook(lambda g: g.clamp(min, max))
    return v

A demo LSTM implementation with gradient clipping can be found here.


its really annoying cuz clamp_ is not documented. I guess one just has to randomly try functions with an underscore at the end and see if they exist. Thanks though.

similar Q:

Would you mind sharing your implementation?

I just always do this
def clip_gradient(model, clip):
""“Clip the gradient.”""
if clip is None:
totalnorm = 0
for p in model.parameters():
if p.grad is None:
continue =,clip)

and follow it up with a normal step call to my optimizer (my code isn’t formatting properly but I think u get the point)


no sx or not about it, doesn’t matter

@raulpuric In fact, your code does the same thing as torch.nn.utils.clip_grad_norm_ , as far as I can see? ie, you’re going to call this between loss.backward() and opt.step(), and clip the gradients after the full backprop has taken place?

@tom what things do you recommend applying the hook to? Are you just applying it to the state from each time step? or … ? What happens if one is using an nn.LSTM, rather than an nn.LSTMCell? Will the hook get called each timestep? Or is it obligatory to use nn.LSTMCell if we want to clip the gradients at each timestep?

You can’t hook anything in the timesteps if you don’t see them, so currently you need to do the looping yourself, making LSTMCell a natural choice.
I hope that PR 14957 brings manual LSTM performance back into the “awesome” region (but didn’t time it yet), then we’d have to see how to get clipping in there.

Best regards