I think you’re right.
So in RNN optimization does clipping over loss + L2 penalty make a big difference to only clipping over loss?
If it does , how should implement the code which can clip over loss + L2 penalty?
Many thanks.
I would remove the weight_decay argument to Adam and explicitly add the L2 penalty to the loss
for p in model.parameters():
if p is not None:
loss += options['reg'] * p
loss.backward()
torch.nn.utils.clip_grad_norm(model.parameters(),options['clip_gradient_norm'])
optimizer.step()