nn.Linear layer transforms shape in the form
(N,*,in_features) -> (N,*,out_features). I have found myself multiple times trying to apply batch normalization after a linear layer. However, because the default
nn.BatchNormNd layers only apply over the dimension 1 (corresponding to channels in the convolutional layers), I can only directly compose
nn.BatchNormNd if there are no additional dimensions input into
nn.Linear. If there are more dimensions, I have to transpose the last dimension into the channel dimension before and after the batch normalization. Is there a better way to do this?
It’s interesting that you want to normalize for each slice in the last dimension, especially considering that it is from a
nn.Linear output. May I ask why?
Upon some further investigation I realized my conception of batch normalization was incorrect, and it should be applied sample-wise following dense layers. However, PyTorch does not help this confusion by labeling the output shape parameter
num_features, which led me to believe batch normalization is applied feature-wise for both convolutional and dense layers.
Yes the name is definitely a bit confusing.
Are you saying that you want to normalize over all but the batch dimension, i.e. each sample is normalized with mean and std computed from that sample? If so, you should use Layer Normalization, which will become available in next release. To achieve the same effects in 0.3.1, you can feed
input.unsqueeze(0) into a BN layer.
Jeeze, this is what I get for jumping in trying to implement white-paper models without putting in the time to fully understand the requisite components. I think what you suggested is correct, but to clarify, is it the case that statistics are computed along slices of the channel dimension and the mini-batch dimension for convolutional layers, but only slices of the mini-batch dimension for dense layers?
Here is what BN does:
assumes that inputs has shape: [B x C x *] B is minibatch size C is the size of second dimension. In image data, it is the channel dim. But generally it is the second dim. * is arbitrary number of dimensions of arbitrary sizes Normalizes each slice input[:, i, :] for i = 0, ..., C - 1, that is normalizing the slice using the mean and std computed from that slice.
So if you have input of shape
[B, *], and do
bn_layer(input.unsqueeze(0)), you are effectively doing:
input_new = input.unsqueeze(0) of shape [1, B, *] output = bn(input_new) normalizes each input_new[:, i] = input[i] slice for i = 0, ..., B - 1. output = output.squeeze(0) gives what you want