I have tested nn.LSTM against simple LSTM implementation and found almost no difference in the performance. Maybe I overestimated the overhead of the additional addition with simple guess. Thank you!
If you’re running on GPU you’ll also likely see great speedups from using cuDNN LSTM implementation.
I have tested in CPU and got no better results than just few milliseconds. (for someone who may try to implement LSTM for benchmarking ) I think some more addition is insignificant than another expensive computations, like multiplication of weight matrices, nonlinear activation functions, or even python loop itself.
Quick question about this @apaszke , are the
Variable.grad.data that we should pass to our clip function a part of the model object or (if we use a different optimizer) - the optimizer object?
In the sense, does optimizer itself call backward? In which case, the code below should pass optimizer to the clip function right?
optimizer.zero_grad() output, hidden = model(data, hidden) loss = criterion(output.view(-1, ntokens), targets) loss.backward() clipped_lr = lr * clip_gradient(model, clip) for p in model.parameters(): p.data.add_(-clipped_lr, p.grad.data) optimizer.step()
I’m sorry but I don’t understand the question. Optimizer never calls
backward() itself, unless you give it a callable argument (see
torch.optim docs for more details on that). BTW you might want to use
Maybe I’m doing something wrong here, but using gradient clipping like
nn.utils.clip_grad_norm(model.parameters(), clip) for p in model.parameters(): p.data.add_(-lr, p.grad.data)
makes my network train much slower than with
Here’s what it looks like with gradient clipping, with
Epoch: 1/10... Step: 10... Loss: 4.4288 Epoch: 1/10... Step: 20... Loss: 4.4274 Epoch: 1/10... Step: 30... Loss: 4.4259 Epoch: 1/10... Step: 40... Loss: 4.4250 Epoch: 1/10... Step: 50... Loss: 4.4237 Epoch: 1/10... Step: 60... Loss: 4.4223 Epoch: 1/10... Step: 70... Loss: 4.4209 Epoch: 1/10... Step: 80... Loss: 4.4193 Epoch: 1/10... Step: 90... Loss: 4.4188 Epoch: 1/10... Step: 100... Loss: 4.4174
And without gradient clipping, everything else equal:
Epoch: 1/10... Step: 10... Loss: 3.2837 Epoch: 1/10... Step: 20... Loss: 3.1901 Epoch: 1/10... Step: 30... Loss: 3.1512 Epoch: 1/10... Step: 40... Loss: 3.1296 Epoch: 1/10... Step: 50... Loss: 3.1170 Epoch: 1/10... Step: 60... Loss: 3.0758 Epoch: 1/10... Step: 70... Loss: 2.9787 Epoch: 1/10... Step: 80... Loss: 2.9104 Epoch: 1/10... Step: 90... Loss: 2.8271 Epoch: 1/10... Step: 100... Loss: 2.6813
There is probably something I don’t understand, but I’m just switching out those two bits of code.
Maybe you’re clipping them to very small values. It’s a possible effect
The one comes with nn.util clips in proportional to the magnitude of the gradients. Thus you’d like to make sure it is not too small for your particular model as Adam said (I think :p). The old-fashioned way of clipping/clampping is
def gradClamp(parameters, clip=5): for p in parameters: p.grad.data.clamp_(max=clip)
for people trying to just get an answer quickly:
or with in-place clamp:
also similar Q:
I thought nn.utils.clip_grad_norm(model.parameters(), clip) is supposed to finish the job.
for p in model.parameters():
Can someone give a more explicit explain? Is it because after I use gradient clipping, I may not use adam optimizer?
torch.nn.utils.clip_grad_norm to keep the gradients within a specific range (clip). In RNNs the gradients tend to grow very large (this is called ‘the exploding gradient problem’), and clipping them helps to prevent this from happening . It is probably helpful to look at the implementation because it teaches us that:
- “The norm is computed over all gradients together, as if they were concatenated into a single vector.”
- You can control the norm type (lp-norm, with p defaulting to 2; or the L-inf norm).
- All of the gradient coefficients are multiplied by the same
- clip_grad_norm is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backwards are not clipped, until the backward pass completes and clip_grad_norm() is invoked. optimizer.step() will then use the updated gradients.
Regarding the code you ask about:
for p in model.parameters(): p.data.add_(-lr, p.grad.data)
This iterates across all of the model.parameters() and performs an in-place multiply-add on each of the parameter tensors §.
p.data.add_ is functionally equal to:
p.data = p.data + (-lr * p.grad.data)
In other words, this performs a similar function as optimizer.step(), using the gradients to updates the model parameters, but without the extra sophistication of a torch.optim.Optimizer. If you use the above code, then you should not use an optimizer (and vice-versa).
clip_grad_norm_ modifies the gradient after the entire backpropagation has taken place. In the RNN context it is common to restrict the gradient that is being backpropagated during the calculation. This is described e.g. in Alex Graves’ famous RNN paper.
To do the latter, you typically use
register_hook on the inputs or outputs of certain operations, e.g. with
lambda x: x.clamp(-10,10) to do element-wise clipping.
For a practical example, you could search for
register_hook in my Graves handwriting generation notebook.
@tom Awesome info. Thanks!
Thanks. Where does this go in relation to forward and backward propagation?
What’s the reason for not using the
optimizer.step after clipping the gradients?
No reason: you can certainly use
optimizer.step() and it will most likely lead to a better solution since the optimizer will update the parameters in a more sophisticated way (e.g. using momentum).
My bad, I thought what you suggest is that if you do gradient clipping, then you should (for some reason) use custom updates instead of
optimizer.step(). Now I got it, you meant that if you use custom updates, then you should not use
optimizer.step() (to avoid mixing custom and auto updates). Makes sense!
You need to use both
optimizer.step and clip right? Because
optimizer.step calcultes the gradient and then you want to clip those gradients to prevent vanishing on the next training step?
loss.backward() calculates the gradient,
clip_grad_norm_ limits it’s norm and
optimizer.step() updates the parameters. But yes, you need the first and last.
Does Variable.grad.data gives access to normalized gradients per batch? If yes, how can I have access to unnormalized gradients?