About torch.nn.utils.clip_grad

jef · November 30, 2017, 7:30pm

I can not understand torch.nn.utils.clip_grad correctly. I saw following code.
http://pytorch.org/docs/master/_modules/torch/nn/utils/clip_grad.html#clip_grad_norm

In this function, I think max_norm is maximum norm of each parameter. But it calculates sum of all norms.
Assume if there are two same grad parameters, (3, 4) and (3, 4) which l2 norm are 5. And given max_norm is 5.
I think parameters’ value will be not changed by this func. But it did.

Now, total_norm is 50 ** 0.5 almost equal to 7.07. So updated value is (3*5/7.07, 4*5/7.07)=(2.12, 2.83)
So it depends on number of parameters because of total_norm.
How do I usually use this func and set max_norm?

I found only one example of using this func.

github.com

pytorch/examples/blob/master/word_language_model/main.py#L162


data, targets = get_batch(train_data, i)
# Starting each batch, we detach the hidden state from how it was previously produced.
# If we didn't, the model would try backpropagating all the way to start of the dataset.
hidden = repackage_hidden(hidden)
model.zero_grad()
output, hidden = model(data, hidden)
loss = criterion(output.view(-1, ntokens), targets)
loss.backward()


# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
for p in model.parameters():
    p.data.add_(-lr, p.grad.data)


total_loss += loss.data


if batch % args.log_interval == 0 and batch > 0:
    cur_loss = total_loss[0] / args.log_interval
    elapsed = time.time() - start_time
    print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
            'loss {:5.2f} | ppl {:8.2f}'.format(

Vijay_Dubey · January 6, 2018, 7:03am

Please reply on this
I have a similar query.

Thanks

jef · January 16, 2018, 11:48pm

Hope some answers.

iwtw · June 15, 2018, 7:34am

I found the explanation here doc
“The norm is computed over all gradients together, as if they were concatenated into a single vector.”