When I add the loss weight or some new losses (keeping the original losses unchanged) to a model, I found the norm of gradient of the model will increase. For example, use the following code:

```
total_norm = 0.0
for p in net.parameters():
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** (1. / 2)
print(total_norm)
```

I wonder if it will affect the model training stability or performance. Should I decrease the learning rate or scale the loss weight to match the original gradient norm? Does anyone can help clarify the relationship between gradient/norm of gradient vs learning rate vs step size? Thanks a lot!

Gradient norm is related to current goodness of fit, optimizer adjustments may not be needed, however parameter sharing during multi-task training may sometimes by problematic by itself.

Thanks for your reply! But I’m not sure about the meaning of “parameter sharing during multi-task training”. Here, I just add some new losses or increase the loss weight in a single-task model. The gradient norm increases. I’m not sure about the effect and the underlying meaning of gradient norm.

two losses = two objectives = two probability distributions being modeled = two “tasks”

It is usually fine if network capacity is big enough, tasks are related (“shared” feature space representations are possible) and value magnitudes are similar.

Your network is not yet trained for the introduced objective, so gradients increase.