I am having a hard time finding a solid PyTorch implementation that adopts normalization layers for recurrent networks. Does this require changing implementations at LSTMCell level, for example in the case of LSTM layers? Is normalization necessarily required to be applied at each time-step separately or on entire output sequence at once?
I would also be very grateful if someone could share their experience on which normalization technique (e.g. batchnorm, layernorm, instancenorm) they found to work best for effectively normalising recurrent layer outputs in Pytorch.