Seeking Advice on Total Grad Norm Values

Hello everyone,

I’m currently working on a deep learning project and have a question about total grad norms( total_grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) ). I’ve noticed that the typical total grad norm values are around 0.2 or less than 1.0, even in larger models. This seems to be the reason why people often set clip_grad_norm values to 0.5 or 1.0 during training.

However, in my own model, I’ve observed grad norm values in the hundreds during training, and surprisingly, the model still performs well. Intuitively, it would make sense that the grad norm value should increase with more parameters. So, I’m curious as to why it is usually smaller than 1.0 in most cases.

I would greatly appreciate any insights, advice, or explanations that could help me better understand these grad norm values and their implications on the training process. Thank you in advance for your help!