PyTorch implementation of normalisation layers for RNNs, LSTMs, GRUs

I am having a hard time finding a solid PyTorch implementation that adopts normalization layers for recurrent networks. Does this require changing implementations at LSTMCell level, for example in the case of LSTM layers? Is normalization necessarily required to be applied at each time-step separately or on entire output sequence at once?

I would also be very grateful if someone could share their experience on which normalization technique (e.g. batchnorm, layernorm, instancenorm) they found to work best for effectively normalising recurrent layer outputs in Pytorch.

That’s a big topic, there is “vertical” and “horizontal” (h[t-1] -> h[t]) normalization, and many approaches have been tried… But basically, simple vertical batchnorm is problematic, LayerNorm(num_features) works, but feature distortion it does may sometimes have adverse effects. You can check Layer Normalization paper, as it addresses rnn normalization. Recurrent batch normalization paper describes horizontal LSTM normalization.