How to choose between clip_grad_norm and BatchNorm2d

  • clip_grad_norm_ performs gradient clipping, in order to mitigate the problem of exploding gradients.
  • BatchNorm2d applies Batch Normalization (for the same reason - mitigate the problem of exploding gradients)

When we should choose clip_grad_norm_ and when we should prefer BatchNorm2d?

I think in practice you will find that:

  • BatchNorm2d is often used in ConvNet layers (e.g. in a standard ResNet block, as seen here as BN):


  • gradient clipping is often used in RNN models (such as LSTM) because the deep recurring structure can cause gradients to blow up

I don’t think these are mutually exclusive. For example, in an OCR model built as a ConvNet stacked on top of a RNN, you may use BatchNorm2d in the ConvNet and gradient clipping for the sake of the RNN.

Hope this helps! (and curious if others have diverging opinions)