I am new to pyTorch, but I have been using Torch for more than a year now.
Basically, I implemented an autoencoder model with gru as encoder and decoder, and it is a multi-gpu implementation. When I incorporate gradient clipping to training, the training gets slower, and it is probably 4 times slower than the training without gradient clipping.
Is it normal? Or is it an issue with my implementation?
The other reason that slows down training was that my lstm model is way too small, and multi-gpu setting didn’t help. Now I switched to single GPU mode, and it runs super fast.
model.parameters() returns a generator which can only be iterated over once before it is exhausted. It has to be re-instantiated every time you want to iterate over it. The proposed code is faster because it is only grad clipping your parameters once, the first time you call torch.nn.utils.clip_grad_norm . Successive calls to torch.nn.utils.clip_grad_norm are operating on an exhausted iterator which effectively does nothing, and so is faster.