I have a question.
I wanted to understand what exactly is done within batch normalization. In the documentation it is stated, that BatchNorm1d performs batch normalization as stated in this paper
I do, however, think that this is not true.
This is what BatchNorm1d does (using default parameters, I tested and it results in the correct output):
Input: Tensor X with shape (N,C,L)
mean_X = torch.mean(X, dim=(0,2), keepdims=True)
var_X = torch.mean((X-mean_X)*(x-mean_X),dim=(0,2), keepdims=True)
BN = (x-mean_x)/np.sqrt(var_x+1e-5)
So the mean and variance are computed over all input axes (accept for C which allows for different channels). In the paper, however, they state that mean and variance are computed over the batch size, which I think corresponds to setting dim=0 above.
Can someone confirm or explain.