Hey everybody,

I have a question.

I wanted to understand what exactly is done within batch normalization. In the documentation it is stated, that BatchNorm1d performs batch normalization as stated in this paper

I do, however, think that this is not true.

This is what BatchNorm1d does (using default parameters, I tested and it results in the correct output):

Input: Tensor X with shape (N,C,L)

mean_X = torch.mean(X, dim=(0,2), keepdims=True)

var_X = torch.mean((X-mean_X)*(x-mean_X),dim=(0,2), keepdims=True)

BN = (x-mean_x)/np.sqrt(var_x+1e-5)

Output: BN

So the mean and variance are computed over all input axes (accept for C which allows for different channels). In the paper, however, they state that mean and variance are computed over the batch size, which I think corresponds to setting dim=0 above.

Can someone confirm or explain.