Is there a proper way to do gradient clipping, for example, with Adam? It seems like that the value of Variable.data.grad should be manipulated (clipped) before calling optimizer.step() method. I think the value of Variable.data.grad can be modified in-place to do gradient clipping. Is it safe to do?
Also, Is there a reason that Autograd RNN cells have separated biases for input-to-hidden and hidden-to-hidden? I think this is redundant and has a some overhead.
You can safely modify Variable.grad.data in-place after the backward pass finishes. For example see how it’s done in the language modelling example.
The reason for that is that it has a nice user facing API where you have both weight tensors exposed. Also, it opens up a possibility of doing batched matrix multiply on the inputs for all steps, and then only applying the hidden-to-hidden weights (it’s not yet added there). If you measure the overhead and prove us that it can be implemented in a clean and fast way, we’ll happily accept a PR or change it.
I have tested nn.LSTM against simple LSTM implementation and found almost no difference in the performance. Maybe I overestimated the overhead of the additional addition with simple guess. Thank you!
I have tested in CPU and got no better results than just few milliseconds. (for someone who may try to implement LSTM for benchmarking ) I think some more addition is insignificant than another expensive computations, like multiplication of weight matrices, nonlinear activation functions, or even python loop itself.
Quick question about this @apaszke , are the Variable.grad.data that we should pass to our clip function a part of the model object or (if we use a different optimizer) - the optimizer object?
In the sense, does optimizer itself call backward? In which case, the code below should pass optimizer to the clip function right?
optimizer.zero_grad()
output, hidden = model(data, hidden)
loss = criterion(output.view(-1, ntokens), targets)
loss.backward()
clipped_lr = lr * clip_gradient(model, clip)
for p in model.parameters():
p.data.add_(-clipped_lr, p.grad.data)
optimizer.step()
I’m sorry but I don’t understand the question. Optimizer never calls backward() itself, unless you give it a callable argument (see torch.optim docs for more details on that). BTW you might want to use torch.nn.utils.clip_grad_norm now.
The one comes with nn.util clips in proportional to the magnitude of the gradients. Thus you’d like to make sure it is not too small for your particular model as Adam said (I think :p). The old-fashioned way of clipping/clampping is
def gradClamp(parameters, clip=5):
for p in parameters:
p.grad.data.clamp_(max=clip)
@ntubertchen
Hi,
Use torch.nn.utils.clip_grad_norm to keep the gradients within a specific range (clip). In RNNs the gradients tend to grow very large (this is called ‘the exploding gradient problem’), and clipping them helps to prevent this from happening . It is probably helpful to look at the implementation because it teaches us that:
“The norm is computed over all gradients together, as if they were concatenated into a single vector.”
You can control the norm type (lp-norm, with p defaulting to 2; or the L-inf norm).
All of the gradient coefficients are multiplied by the same clip_coef.
clip_grad_norm is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backwards are not clipped, until the backward pass completes and clip_grad_norm() is invoked. optimizer.step() will then use the updated gradients.
Regarding the code you ask about:
for p in model.parameters():
p.data.add_(-lr, p.grad.data)
This iterates across all of the model.parameters() and performs an in-place multiply-add on each of the parameter tensors §.
p.data.add_ is functionally equal to:
p.data = p.data + (-lr * p.grad.data)
In other words, this performs a similar function as optimizer.step(), using the gradients to updates the model parameters, but without the extra sophistication of a torch.optim.Optimizer. If you use the above code, then you should not use an optimizer (and vice-versa).
Note that clip_grad_norm_ modifies the gradient after the entire backpropagation has taken place. In the RNN context it is common to restrict the gradient that is being backpropagated during the calculation. This is described e.g. in Alex Graves’ famous RNN paper.
To do the latter, you typically use register_hook on the inputs or outputs of certain operations, e.g. with lambda x: x.clamp(-10,10) to do element-wise clipping.
For a practical example, you could search for register_hook in my Graves handwriting generation notebook.
No reason: you can certainly use optimizer.step() and it will most likely lead to a better solution since the optimizer will update the parameters in a more sophisticated way (e.g. using momentum).
My bad, I thought what you suggest is that if you do gradient clipping, then you should (for some reason) use custom updates instead of optimizer.step(). Now I got it, you meant that if you use custom updates, then you should not use optimizer.step() (to avoid mixing custom and auto updates). Makes sense!
You need to use both optimizer.step and clip right? Because optimizer.step calcultes the gradient and then you want to clip those gradients to prevent vanishing on the next training step?