Is it possible to perform batch normalization in a network that is only linear layers?
For example:
class network(nn.Module):
def __init__(self):
super(network, self).__init__()
self.linear1 = nn.Linear(in_features=40, out_features=320)
self.linear2 = nn.Linear(in_features=320, out_features=2)
def forward(input): # Input is a 1D tensor
y = F.relu(self.linear1(input))
# Would it be possible to do a batch normalization of y overhere? If so how?
y = F.softmax(self.linear2(input))
return y
When I do that I get a different error.: “ValueError: Expected more than 1 value per channel when training, got input size [1, 320].” This is for Q network, so it only receives one state at a time, hence the batch size of 1.
You will most likely see a different performance depending on where you place the batchnorm layer, since the input activation will have a different distribution.
@shirui-japina In general, Batch Norm layer is usually added before ReLU(as mentioned in the Batch Normalization paper). But there is no real standard being followed as to where to add a Batch Norm layer. You can experiment with different settings and you may find different performances for each setting.
As far as I know, generally you will find batch norm as part of the feature extraction branch of a network and not in its classification branch(nn.Linear).
Sorry to take your time. I have a question, I normalized my patch before training, and my ANN is 2CNN layer with 2 fully connected layer. Is it necessary to do batch normalization or since the layers are not very deep it is not necessary?
Oh, my best advice is to try out both approaches and compare the validation accuracy with and without batchnorm layers.
I don’t have a specific advice on when to use them with respect to the number of layers.
Let us know, which model worked better!
PS: Also, compare the training and validation accuracy to pick the right model, not the test accuracy, as you would leak the test data information into your model selection process.
You most likeley will not see a drastic change in the network performance (get higher acc,etc). however, batchnorml incur around 30% overhead to your network runtime. it will affect your training as well as inference unless at inference you fuse them.
All in all BatchNorm shines when you have a very deep architecture, what you have there is not really considered deep that much.
You may very well update us with the result you get though
Cheers.
BatchNorm was introduced to distribute the data uniformly across a mean that the network sees best, before squashing it by the activation function. Without the BN, the activations could over or undershoot, depending on the squashing function though.
Hence, even in practice, BN before the activation function gives better performance.
I mean, for the sake of putting it, one can put a dropout as the very first layer, or even with Conv layers, and the network will still train. But, that doesn’t make any sense.