I have tested nn.LSTM against simple LSTM implementation and found almost no difference in the performance. Maybe I overestimated the overhead of the additional addition with simple guess. Thank you!
If you’re running on GPU you’ll also likely see great speedups from using cuDNN LSTM implementation.
I have tested in CPU and got no better results than just few milliseconds. (for someone who may try to implement LSTM for benchmarking ) I think some more addition is insignificant than another expensive computations, like multiplication of weight matrices, nonlinear activation functions, or even python loop itself.
Quick question about this @apaszke , are the Variable.grad.data
that we should pass to our clip function a part of the model object or (if we use a different optimizer) - the optimizer object?
In the sense, does optimizer itself call backward? In which case, the code below should pass optimizer to the clip function right?
optimizer.zero_grad()
output, hidden = model(data, hidden)
loss = criterion(output.view(-1, ntokens), targets)
loss.backward()
clipped_lr = lr * clip_gradient(model, clip)
for p in model.parameters():
p.data.add_(-clipped_lr, p.grad.data)
optimizer.step()
I’m sorry but I don’t understand the question. Optimizer never calls backward()
itself, unless you give it a callable argument (see torch.optim
docs for more details on that). BTW you might want to use torch.nn.utils.clip_grad_norm
now.
Maybe I’m doing something wrong here, but using gradient clipping like
nn.utils.clip_grad_norm(model.parameters(), clip)
for p in model.parameters():
p.data.add_(-lr, p.grad.data)
makes my network train much slower than with optimizer.step()
.
Here’s what it looks like with gradient clipping, with clip=5
:
Epoch: 1/10... Step: 10... Loss: 4.4288
Epoch: 1/10... Step: 20... Loss: 4.4274
Epoch: 1/10... Step: 30... Loss: 4.4259
Epoch: 1/10... Step: 40... Loss: 4.4250
Epoch: 1/10... Step: 50... Loss: 4.4237
Epoch: 1/10... Step: 60... Loss: 4.4223
Epoch: 1/10... Step: 70... Loss: 4.4209
Epoch: 1/10... Step: 80... Loss: 4.4193
Epoch: 1/10... Step: 90... Loss: 4.4188
Epoch: 1/10... Step: 100... Loss: 4.4174
And without gradient clipping, everything else equal:
Epoch: 1/10... Step: 10... Loss: 3.2837
Epoch: 1/10... Step: 20... Loss: 3.1901
Epoch: 1/10... Step: 30... Loss: 3.1512
Epoch: 1/10... Step: 40... Loss: 3.1296
Epoch: 1/10... Step: 50... Loss: 3.1170
Epoch: 1/10... Step: 60... Loss: 3.0758
Epoch: 1/10... Step: 70... Loss: 2.9787
Epoch: 1/10... Step: 80... Loss: 2.9104
Epoch: 1/10... Step: 90... Loss: 2.8271
Epoch: 1/10... Step: 100... Loss: 2.6813
There is probably something I don’t understand, but I’m just switching out those two bits of code.
Maybe you’re clipping them to very small values. It’s a possible effect
The one comes with nn.util clips in proportional to the magnitude of the gradients. Thus you’d like to make sure it is not too small for your particular model as Adam said (I think :p). The old-fashioned way of clipping/clampping is
def gradClamp(parameters, clip=5):
for p in parameters:
p.grad.data.clamp_(max=clip)
for people trying to just get an answer quickly:
torch.nn.utils.clip_grad_norm(mdl_sgd.parameters(),clip)
or with in-place clamp:
W.grad.data.clamp_(-clip,clip)
also similar Q:
I thought nn.utils.clip_grad_norm(model.parameters(), clip) is supposed to finish the job.
What is:
for p in model.parameters():
p.data.add_(-lr, p.grad.data)
for?
Can someone give a more explicit explain? Is it because after I use gradient clipping, I may not use adam optimizer?
@ntubertchen
Hi,
Use torch.nn.utils.clip_grad_norm
to keep the gradients within a specific range (clip). In RNNs the gradients tend to grow very large (this is called ‘the exploding gradient problem’), and clipping them helps to prevent this from happening . It is probably helpful to look at the implementation because it teaches us that:
- “The norm is computed over all gradients together, as if they were concatenated into a single vector.”
- You can control the norm type (lp-norm, with p defaulting to 2; or the L-inf norm).
- All of the gradient coefficients are multiplied by the same
clip_coef
. - clip_grad_norm is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backwards are not clipped, until the backward pass completes and clip_grad_norm() is invoked. optimizer.step() will then use the updated gradients.
Regarding the code you ask about:
for p in model.parameters():
p.data.add_(-lr, p.grad.data)
This iterates across all of the model.parameters() and performs an in-place multiply-add on each of the parameter tensors §.
p.data.add_ is functionally equal to:
p.data = p.data + (-lr * p.grad.data)
In other words, this performs a similar function as optimizer.step(), using the gradients to updates the model parameters, but without the extra sophistication of a torch.optim.Optimizer. If you use the above code, then you should not use an optimizer (and vice-versa).
Cheers,
Neta
Note that clip_grad_norm_
modifies the gradient after the entire backpropagation has taken place. In the RNN context it is common to restrict the gradient that is being backpropagated during the calculation. This is described e.g. in Alex Graves’ famous RNN paper.
To do the latter, you typically use register_hook
on the inputs or outputs of certain operations, e.g. with lambda x: x.clamp(-10,10)
to do element-wise clipping.
For a practical example, you could search for register_hook
in my Graves handwriting generation notebook.
Best regards
Thomas
Thanks. Where does this go in relation to forward and backward propagation?
What’s the reason for not using the optimizer.step
after clipping the gradients?
No reason: you can certainly use optimizer.step()
and it will most likely lead to a better solution since the optimizer will update the parameters in a more sophisticated way (e.g. using momentum).
My bad, I thought what you suggest is that if you do gradient clipping, then you should (for some reason) use custom updates instead of optimizer.step()
. Now I got it, you meant that if you use custom updates, then you should not use optimizer.step()
(to avoid mixing custom and auto updates). Makes sense!
You need to use both optimizer.step
and clip right? Because optimizer.step
calcultes the gradient and then you want to clip those gradients to prevent vanishing on the next training step?
No, loss.backward()
calculates the gradient, clip_grad_norm_
limits it’s norm and optimizer.step()
updates the parameters. But yes, you need the first and last.
Best regards
Thomas
Does Variable.grad.data gives access to normalized gradients per batch? If yes, how can I have access to unnormalized gradients?