Batch Normalization of Linear Layers

Is it possible to perform batch normalization in a network that is only linear layers?

For example:

class network(nn.Module):
    def __init__(self):
        super(network, self).__init__()
        self.linear1 = nn.Linear(in_features=40, out_features=320)
        self.linear2 = nn.Linear(in_features=320, out_features=2)

    def forward(input):  # Input is a 1D tensor
        y = F.relu(self.linear1(input))
        # Would it be possible to do a batch normalization of y overhere? If so how?
        y = F.softmax(self.linear2(input))
        return y

Sure! You could just use nn.BatchNorm1d.
There are some minor issues in your code, so here is a working example:

class network(nn.Module):
    def __init__(self):
        super(network, self).__init__()
        self.linear1 = nn.Linear(in_features=40, out_features=320)
        self.bn1 = nn.BatchNorm1d(num_features=320)
        self.linear2 = nn.Linear(in_features=320, out_features=2)

    def forward(self, input):  # Input is a 1D tensor
        y = F.relu(self.bn1(self.linear1(input)))
        y = F.softmax(self.linear2(y), dim=1)
        return y
model = network()
x = torch.randn(10, 40)
output = model(x)

You can also put the BatchNorm after the relu, if you like.


@ptrblck I tried that but I received “ValueError: expected a 2D or 3D input (got 1D input).”

Are you sure you are passing your input as [batch_dim, num_features]?
The error sounds like you’ve passed just [num_features] to your model.

When I do that I get a different error.: “ValueError: Expected more than 1 value per channel when training, got input size [1, 320].” This is for Q network, so it only receives one state at a time, hence the batch size of 1.

Then nn.BatchNorm probably won’t work very well.
Have a look at the normalization layers. Maybe LayerNorm or another one will fit your needs.

Is it the same effect that put the BatchNorm before or after the ReLU?:thinking::thinking:

You will most likely see a different performance depending on where you place the batchnorm layer, since the input activation will have a different distribution.

So…where should I place the BatchNorm layer, to train a great performance model?
(Not only linear layers model, but like CNN or RNN):flushed::flushed:

  1. Between each layer?:thinking:

  2. Just before or after the activation function layer?:thinking:

  3. Should before or after the activation function layer?:thinking:

And where I shouldn’t place the BatchNorm layer?

@shirui-japina In general, Batch Norm layer is usually added before ReLU(as mentioned in the Batch Normalization paper). But there is no real standard being followed as to where to add a Batch Norm layer. You can experiment with different settings and you may find different performances for each setting.

As far as I know, generally you will find batch norm as part of the feature extraction branch of a network and not in its classification branch(nn.Linear).


Thanks for your reply.:smiley:

So the place of BatchNorm layer in CNN is like this:



How about the pooling layer?
Should we place BatchNorm layer before the pooling layer?:thinking::thinking:

1 Like

If you ask me, I would place it after the pooling layer. But you can check out how vision models are implemented in pytorch to get clarity.


Got it, thanks for your help.:smiley::smiley:

1 Like

Hi Ptrblck

Sorry to take your time. I have a question, I normalized my patch before training, and my ANN is 2CNN layer with 2 fully connected layer. Is it necessary to do batch normalization or since the layers are not very deep it is not necessary?

Oh, my best advice is to try out both approaches and compare the validation accuracy with and without batchnorm layers.
I don’t have a specific advice on when to use them with respect to the number of layers. :confused:

Let us know, which model worked better! :slight_smile:

PS: Also, compare the training and validation accuracy to pick the right model, not the test accuracy, as you would leak the test data information into your model selection process.


You most likeley will not see a drastic change in the network performance (get higher acc,etc). however, batchnorml incur around 30% overhead to your network runtime. it will affect your training as well as inference unless at inference you fuse them.
All in all BatchNorm shines when you have a very deep architecture, what you have there is not really considered deep that much.
You may very well update us with the result you get though

1 Like

Thanks for your answer

Shouldn’t we set the bias to False for the linear and conv layers when using batch norm?

BatchNorm was introduced to distribute the data uniformly across a mean that the network sees best, before squashing it by the activation function. Without the BN, the activations could over or undershoot, depending on the squashing function though.

Hence, even in practice, BN before the activation function gives better performance.

I mean, for the sake of putting it, one can put a dropout as the very first layer, or even with Conv layers, and the network will still train. But, that doesn’t make any sense.


Hi, if my input is [batch_dim, temporal_length, channel_dim], how to do batch normalization?