From my understanding, batch normalization means normalize over all batches for each channel each time where as layer normalization is being used to normalize over all channels. But I got confused why layer normalization is more being used in language model whereas batch is more in CNN. Because in my case, I wanna to extract correlated features between the channels of my dataset, I am not sure which type of normalization I should be using. From my understanding, it makes more sense to use layer normalization, since I can normalize over all channels instead of all batches which I am trying to avoid (do not want the model to be confused that there are certain relationship of the sequences in a batch).