Gradient Clipping makes training slow

Hi all,

I am new to pyTorch, but I have been using Torch for more than a year now.

Basically, I implemented an autoencoder model with gru as encoder and decoder, and it is a multi-gpu implementation. When I incorporate gradient clipping to training, the training gets slower, and it is probably 4 times slower than the training without gradient clipping.

Is it normal? Or is it an issue with my implementation?

Thank you so much for your time.

1 Like

I don’t think gradient clipping would be time-consuming. I guess it’s simply an operation on tensor which should be fast.

Yes, I agree.

I guess the issue was caused by calling the following line every iteration

torch.nn.utils.clip_grad_norm(model.parameters(), config.clip)

Instead, I tried with another way.

First, call

params = model.parameters()

before training, and then call

torch.nn.utils.clip_grad_norm(params, config.clip)

in every iteration. Right now, the training speed is reasonable now compared to my torch implementation.


The other reason that slows down training was that my lstm model is way too small, and multi-gpu setting didn’t help. Now I switched to single GPU mode, and it runs super fast.

Why does this seem to cause such a drastic change in performance? Is it because the iterators are recreated or something?

But program like that won’t take effect on the actual parameters.

1 Like

Expanding on @imy’s reply…

model.parameters() returns a generator which can only be iterated over once before it is exhausted. It has to be re-instantiated every time you want to iterate over it. The proposed code is faster because it is only grad clipping your parameters once, the first time you call torch.nn.utils.clip_grad_norm . Successive calls to torch.nn.utils.clip_grad_norm are operating on an exhausted iterator which effectively does nothing, and so is faster.

1 Like