Proper way to do gradient clipping?

Is there a proper way to do gradient clipping, for example, with Adam? It seems like that the value of Variable.data.grad should be manipulated (clipped) before calling optimizer.step() method. I think the value of Variable.data.grad can be modified in-place to do gradient clipping. Is it safe to do?

Also, Is there a reason that Autograd RNN cells have separated biases for input-to-hidden and hidden-to-hidden? I think this is redundant and has a some overhead.

14 Likes

You can safely modify Variable.grad.data in-place after the backward pass finishes. For example see how it’s done in the language modelling example.

The reason for that is that it has a nice user facing API where you have both weight tensors exposed. Also, it opens up a possibility of doing batched matrix multiply on the inputs for all steps, and then only applying the hidden-to-hidden weights (it’s not yet added there). If you measure the overhead and prove us that it can be implemented in a clean and fast way, we’ll happily accept a PR or change it.

6 Likes

I have tested nn.LSTM against simple LSTM implementation and found almost no difference in the performance. Maybe I overestimated the overhead of the additional addition with simple guess. Thank you!

If you’re running on GPU you’ll also likely see great speedups from using cuDNN LSTM implementation.

I have tested in CPU and got no better results than just few milliseconds. (for someone who may try to implement LSTM for benchmarking :slight_smile: ) I think some more addition is insignificant than another expensive computations, like multiplication of weight matrices, nonlinear activation functions, or even python loop itself.

1 Like

Quick question about this @apaszke , are the Variable.grad.data that we should pass to our clip function a part of the model object or (if we use a different optimizer) - the optimizer object?

In the sense, does optimizer itself call backward? In which case, the code below should pass optimizer to the clip function right?

        optimizer.zero_grad()
        output, hidden = model(data, hidden)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        clipped_lr = lr * clip_gradient(model, clip)
        for p in model.parameters():
            p.data.add_(-clipped_lr, p.grad.data)
            
        optimizer.step()

I’m sorry but I don’t understand the question. Optimizer never calls backward() itself, unless you give it a callable argument (see torch.optim docs for more details on that). BTW you might want to use torch.nn.utils.clip_grad_norm now.

8 Likes

Maybe I’m doing something wrong here, but using gradient clipping like

nn.utils.clip_grad_norm(model.parameters(), clip)
for p in model.parameters():
    p.data.add_(-lr, p.grad.data)

makes my network train much slower than with optimizer.step().

Here’s what it looks like with gradient clipping, with clip=5:

Epoch: 1/10... Step: 10... Loss: 4.4288
Epoch: 1/10... Step: 20... Loss: 4.4274
Epoch: 1/10... Step: 30... Loss: 4.4259
Epoch: 1/10... Step: 40... Loss: 4.4250
Epoch: 1/10... Step: 50... Loss: 4.4237
Epoch: 1/10... Step: 60... Loss: 4.4223
Epoch: 1/10... Step: 70... Loss: 4.4209
Epoch: 1/10... Step: 80... Loss: 4.4193
Epoch: 1/10... Step: 90... Loss: 4.4188
Epoch: 1/10... Step: 100... Loss: 4.4174

And without gradient clipping, everything else equal:

Epoch: 1/10... Step: 10... Loss: 3.2837
Epoch: 1/10... Step: 20... Loss: 3.1901
Epoch: 1/10... Step: 30... Loss: 3.1512
Epoch: 1/10... Step: 40... Loss: 3.1296
Epoch: 1/10... Step: 50... Loss: 3.1170
Epoch: 1/10... Step: 60... Loss: 3.0758
Epoch: 1/10... Step: 70... Loss: 2.9787
Epoch: 1/10... Step: 80... Loss: 2.9104
Epoch: 1/10... Step: 90... Loss: 2.8271
Epoch: 1/10... Step: 100... Loss: 2.6813

There is probably something I don’t understand, but I’m just switching out those two bits of code.

Maybe you’re clipping them to very small values. It’s a possible effect

1 Like

The one comes with nn.util clips in proportional to the magnitude of the gradients. Thus you’d like to make sure it is not too small for your particular model as Adam said (I think :p). The old-fashioned way of clipping/clampping is

def gradClamp(parameters, clip=5):
    for p in parameters:
        p.grad.data.clamp_(max=clip)
2 Likes

for people trying to just get an answer quickly:

torch.nn.utils.clip_grad_norm(mdl_sgd.parameters(),clip)

or with in-place clamp:

W.grad.data.clamp_(-clip,clip)

also similar Q:

40 Likes

I thought nn.utils.clip_grad_norm(model.parameters(), clip) is supposed to finish the job.

What is:
for p in model.parameters():
p.data.add_(-lr, p.grad.data)
for?

Can someone give a more explicit explain? Is it because after I use gradient clipping, I may not use adam optimizer?

5 Likes

@ntubertchen
Hi,
Use torch.nn.utils.clip_grad_norm to keep the gradients within a specific range (clip). In RNNs the gradients tend to grow very large (this is called ‘the exploding gradient problem’), and clipping them helps to prevent this from happening . It is probably helpful to look at the implementation because it teaches us that:

  1. “The norm is computed over all gradients together, as if they were concatenated into a single vector.”
  2. You can control the norm type (lp-norm, with p defaulting to 2; or the L-inf norm).
  3. All of the gradient coefficients are multiplied by the same clip_coef.
  4. clip_grad_norm is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backwards are not clipped, until the backward pass completes and clip_grad_norm() is invoked. optimizer.step() will then use the updated gradients.

Regarding the code you ask about:

for p in model.parameters():
    p.data.add_(-lr, p.grad.data)

This iterates across all of the model.parameters() and performs an in-place multiply-add on each of the parameter tensors §.
p.data.add_ is functionally equal to:

p.data = p.data + (-lr * p.grad.data)

In other words, this performs a similar function as optimizer.step(), using the gradients to updates the model parameters, but without the extra sophistication of a torch.optim.Optimizer. If you use the above code, then you should not use an optimizer (and vice-versa).

Cheers,
Neta

16 Likes

Note that clip_grad_norm_ modifies the gradient after the entire backpropagation has taken place. In the RNN context it is common to restrict the gradient that is being backpropagated during the calculation. This is described e.g. in Alex Graves’ famous RNN paper.
To do the latter, you typically use register_hook on the inputs or outputs of certain operations, e.g. with lambda x: x.clamp(-10,10) to do element-wise clipping.
For a practical example, you could search for register_hook in my Graves handwriting generation notebook.

Best regards

Thomas

22 Likes

@tom Awesome info. Thanks!

Thanks. Where does this go in relation to forward and backward propagation?

2 Likes

What’s the reason for not using the optimizer.step after clipping the gradients?

1 Like

No reason: you can certainly use optimizer.step() and it will most likely lead to a better solution since the optimizer will update the parameters in a more sophisticated way (e.g. using momentum).

1 Like

My bad, I thought what you suggest is that if you do gradient clipping, then you should (for some reason) use custom updates instead of optimizer.step(). Now I got it, you meant that if you use custom updates, then you should not use optimizer.step() (to avoid mixing custom and auto updates). Makes sense!

1 Like

You need to use both optimizer.step and clip right? Because optimizer.step calcultes the gradient and then you want to clip those gradients to prevent vanishing on the next training step?