Batch Normalization of Linear Layers

WR01 · July 11, 2018, 12:32am

Is it possible to perform batch normalization in a network that is only linear layers?

For example:

class network(nn.Module):
    def __init__(self):
        super(network, self).__init__()
        self.linear1 = nn.Linear(in_features=40, out_features=320)
        self.linear2 = nn.Linear(in_features=320, out_features=2)

    def forward(input):  # Input is a 1D tensor
        y = F.relu(self.linear1(input))
        # Would it be possible to do a batch normalization of y overhere? If so how?
        y = F.softmax(self.linear2(input))
        return y

ptrblck · July 11, 2018, 9:24am

Sure! You could just use nn.BatchNorm1d.
There are some minor issues in your code, so here is a working example:

class network(nn.Module):
    def __init__(self):
        super(network, self).__init__()
        self.linear1 = nn.Linear(in_features=40, out_features=320)
        self.bn1 = nn.BatchNorm1d(num_features=320)
        self.linear2 = nn.Linear(in_features=320, out_features=2)

    def forward(self, input):  # Input is a 1D tensor
        y = F.relu(self.bn1(self.linear1(input)))
        y = F.softmax(self.linear2(y), dim=1)
        return y
    
model = network()
x = torch.randn(10, 40)
output = model(x)

You can also put the BatchNorm after the relu, if you like.

WR01 · July 11, 2018, 4:25pm

@ptrblck I tried that but I received “ValueError: expected a 2D or 3D input (got 1D input).”

ptrblck · July 11, 2018, 4:29pm

Are you sure you are passing your input as [batch_dim, num_features]?
The error sounds like you’ve passed just [num_features] to your model.

WR01 · July 11, 2018, 4:35pm

When I do that I get a different error.: “ValueError: Expected more than 1 value per channel when training, got input size [1, 320].” This is for Q network, so it only receives one state at a time, hence the batch size of 1.

ptrblck · July 11, 2018, 4:41pm

Then nn.BatchNorm probably won’t work very well.
Have a look at the normalization layers. Maybe LayerNorm or another one will fit your needs.

shirui-japina · October 19, 2019, 7:59am

Is it the same effect that put the BatchNorm before or after the ReLU?

ptrblck · October 19, 2019, 10:54am

You will most likely see a different performance depending on where you place the batchnorm layer, since the input activation will have a different distribution.

shirui-japina · October 19, 2019, 11:12am

So…where should I place the BatchNorm layer, to train a great performance model?
(Not only linear layers model, but like CNN or RNN)

Between each layer?
Just before or after the activation function layer?
Should before or after the activation function layer?

And where I shouldn’t place the BatchNorm layer?

mailcorahul · October 19, 2019, 12:14pm

@shirui-japina In general, Batch Norm layer is usually added before ReLU(as mentioned in the Batch Normalization paper). But there is no real standard being followed as to where to add a Batch Norm layer. You can experiment with different settings and you may find different performances for each setting.

As far as I know, generally you will find batch norm as part of the feature extraction branch of a network and not in its classification branch(nn.Linear).

shirui-japina · October 19, 2019, 12:33pm

Thanks for your reply.

So the place of BatchNorm layer in CNN is like this:
CNN(
convolution-layer-1,
batch-norm-layer-1,
activate-layer(ReLU),

convolution-layer-2,
batch-norm-layer-2,
activate-layer(ReLU),

fully-connection-layer,
)

How about the pooling layer?
Should we place BatchNorm layer before the pooling layer?

mailcorahul · October 19, 2019, 1:01pm

If you ask me, I would place it after the pooling layer. But you can check out how vision models are implemented in pytorch to get clarity.

shirui-japina · October 19, 2019, 1:14pm

Got it, thanks for your help.

saba · January 21, 2020, 2:54am

Hi Ptrblck

Sorry to take your time. I have a question, I normalized my patch before training, and my ANN is 2CNN layer with 2 fully connected layer. Is it necessary to do batch normalization or since the layers are not very deep it is not necessary?

ptrblck · January 21, 2020, 3:13am

Oh, my best advice is to try out both approaches and compare the validation accuracy with and without batchnorm layers.
I don’t have a specific advice on when to use them with respect to the number of layers.

Let us know, which model worked better!

PS: Also, compare the training and validation accuracy to pick the right model, not the test accuracy, as you would leak the test data information into your model selection process.

Shisho_Sama · January 21, 2020, 3:34am

You most likeley will not see a drastic change in the network performance (get higher acc,etc). however, batchnorml incur around 30% overhead to your network runtime. it will affect your training as well as inference unless at inference you fuse them.
All in all BatchNorm shines when you have a very deep architecture, what you have there is not really considered deep that much.
You may very well update us with the result you get though
Cheers.

saba · January 21, 2020, 3:39am

Thanks for your answer

Kale-ab_Tessera · January 22, 2020, 8:56am

Shouldn’t we set the bias to False for the linear and conv layers when using batch norm?

Anshumaan_Dash · May 2, 2020, 11:44pm

BatchNorm was introduced to distribute the data uniformly across a mean that the network sees best, before squashing it by the activation function. Without the BN, the activations could over or undershoot, depending on the squashing function though.

Hence, even in practice, BN before the activation function gives better performance.

I mean, for the sake of putting it, one can put a dropout as the very first layer, or even with Conv layers, and the network will still train. But, that doesn’t make any sense.

STU · August 4, 2020, 7:01am

Hi, if my input is [batch_dim, temporal_length, channel_dim], how to do batch normalization?