The cutoff threshold for gradient clipping is set based on the average norm of the gradient over one pass on the data. I would therefore like to compute the average norm of the gradient to find a fitting gradient clipping value for my model. How can this be done in PyTorch?

Another quick question: I have seen the following in the language modeling example:

# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
for p in model.parameters():
p.data.add_(-lr, p.grad.data)

If clip_grad_norm is already applied to model.parameters(), why we need the for loop?

The for loop is for the gradient descent update which is manually implemented in the example. Parameters are reduced by their gradient times learning rate.

To your first question, if you are referring to Pascanu et al. clipping which is based on the norm of the gradient, then torch.nn.utils.clip_grad_norm does that for you. The clipping threshold is usually tuned as a hyperparameter as there is no way to determine what the norm of the gradients would be through the training.