I have a deep neural network for classifying videos. I was wondering what layers should have batch normalization. Is it empirical to find out that? When and where should I use the batch normalization? What happens if I use more than enough batch normalizations?
BatchNorm helps in the training of most Deep Networks, that said if your batch size is small (less than 8), it is known training with BatchNorm yields sub optimal results, and using other normalization tricks like GroupNorm is helpful.
In ResNets BatchNorm is placed usually before a skip connection or just before the activation layer (ReLU).
Also, if you are using a shallow network (less than 6 layers), you can train your model without BatchNorm, and might experience marginally slow convergence, but forward pass for networks without BatchNorm is certainly less time consuming, and hence it might be a better trade off to train your model without BatchNorm, but experiments are certainly needed to evaluate this trade off, as things often depend on your dataset and other problem specific parameters.