over which dimension do we calculate the mean and std? Is it over the hidden dimensions of the NN Layer, or over all the samples in the batch for every hidden dimension separately?
In the paper it says we normalize over the batch.
In torch.nn.BatchNorm1d hower the input argument is “num_features”. Why would we calculate the mean and std over the different features instead of the different samples?

You are correct that num_features corresponds to the “hidden dimension” rather than the batch size. However, if you think about this from the perspective of what statistics batchnorm needs to track, this makes sense. For example, for a hidden dimension of size 512, batchnorm needs to keep track of mean and variance for each of the 512 dimensions. Here, num_features is really just telling the module how much storage it needs to track its stats. Note that this size doesn’t depend on the batch size as taking the mean reduces across the batch dimension.

@eqy
Thanks for the answer. Assuming I have sequential data of the shape (bs, dim, seq_len), does BatchNorm1d calculate the mean and std of the batch seperately for every timestep, or are the timesteps somehow merged as well?