Gradient Clipping makes training slow

shuaitang · September 29, 2017, 5:21am

Hi all,

I am new to pyTorch, but I have been using Torch for more than a year now.

Basically, I implemented an autoencoder model with gru as encoder and decoder, and it is a multi-gpu implementation. When I incorporate gradient clipping to training, the training gets slower, and it is probably 4 times slower than the training without gradient clipping.

Is it normal? Or is it an issue with my implementation?

Thank you so much for your time.

chenyuntc · September 29, 2017, 10:59am

I don’t think gradient clipping would be time-consuming. I guess it’s simply an operation on tensor which should be fast.

shuaitang · September 29, 2017, 4:55pm

Yes, I agree.

I guess the issue was caused by calling the following line every iteration

torch.nn.utils.clip_grad_norm(model.parameters(), config.clip)

Instead, I tried with another way.

First, call

params = model.parameters()

before training, and then call

torch.nn.utils.clip_grad_norm(params, config.clip)

in every iteration. Right now, the training speed is reasonable now compared to my torch implementation.

shuaitang · September 29, 2017, 5:34pm

The other reason that slows down training was that my lstm model is way too small, and multi-gpu setting didn’t help. Now I switched to single GPU mode, and it runs super fast.

srama · August 11, 2018, 4:27am

Why does this seem to cause such a drastic change in performance? Is it because the iterators are recreated or something?

imy · March 7, 2019, 4:17am

But program like that won’t take effect on the actual parameters.

david-macleod · November 18, 2019, 8:07am

Expanding on @imy’s reply…

model.parameters() returns a generator which can only be iterated over once before it is exhausted. It has to be re-instantiated every time you want to iterate over it. The proposed code is faster because it is only grad clipping your parameters once, the first time you call torch.nn.utils.clip_grad_norm . Successive calls to torch.nn.utils.clip_grad_norm are operating on an exhausted iterator which effectively does nothing, and so is faster.