BatchNorm1d - input shape

As far as I understand the documentation for BatchNorm1d layer we provide number of features as argument to constructor(nn.BatchNorm1d(number of features)).

As an input the layer takes (N, C, L), where N is batch size (I guess…), C is the number of features (this is the dimension where normalization is computed), and L is the input size.

Let’s assume I have input in following shape:
(batch_size, number_of_timesteps, number_of_features)
which is usual data shape for time series if batch_first=True.

Should I transpose the input (swap dimension 1 and 2) before running the batch normalization?

In this case I will have to transpose the output again to use it in RNN later. It looks quite weird to me.

Can someone please take a look at below example and let me know if this is the proper way.


import torch
from torch import nn

# data (batch size, number of time steps, number of features)
x = torch.rand(3, 4, 5)
# layers
bn = nn.BatchNorm1d(5)
rnn = nn.RNN(5, 10, 1, batch_first=True)
# computation - transpose TWICE
x_normalized = bn(x.transpose(1, 2)).transpose(1, 2)

Your code looks correct, since the batchnorm layer expects an input in [batch_size, features, temp. dim] so you would need to permute it before (and after to match the input of the rnn).
In your code snippet you could of course initialize the tensor in the right shape, but I assume that code is just to show the usage.

1 Like

Isn’t this a fundamentally flawed approach? BatchNorm1d is not working harmoniously with nn.Linear, which is the most fundamental part of PyTorch. This makes it not possible to use BatchNorm and Linear layer in the sequential format. At least to my knowledge.

It depends on what you are trying to achieve and I wouldn’t claim the approach is fundamentally flawed, as I’m not familiar with the use case of @adm.

You can use nn.BatchNorm1d layers in an nn.Sequential container as seen here:

model = nn.Sequential(
    nn.Linear(4, 3),

x = torch.randn(2, 4)
out = model(x)

If you add another dimension and want to permute the activation, you could write a custom module, which permutes it to the right shape.

For 2D Batches it works great. However, when we use fully connected layers for 3D inputs of shape (N=batch size,L=sequence length,C=input size) we have to transpose 2 times to use BatchNorm1D after each linear transformation. Because each linear layer acts upon dim=-1 (our features) and BatchNorm can act only on dim=1, 2 transpositions must be done: before and after the BatchNorm. For example, a fully connected network of 3 layers is given below with BatchNorm before each non-linearity.

class FC_layer(nn.Module):
    def __init__(self,input_size_FC1,output_size_FC1,output_size_FC2,output_size_FC3):
        super(FC_layer, self).__init__()
        self.linear_layer1 = nn.Linear(input_size_FC1,output_size_FC1)
        self.normalization1 = nn.BatchNorm1d(output_size_FC1)
        self.linear_layer2 = nn.Linear(output_size_FC1,output_size_FC2)
        self.normalization2 = nn.BatchNorm1d(output_size_FC2)
        self.linear_layer3 = nn.Linear(output_size_FC2,output_size_FC3)
        self.normalization3 = nn.BatchNorm1d(output_size_FC3)
    def forward(self,x):
        x : tensor of shape (N_batch, N_sequence, N_features)
        f = nn.ReLU() # Activation functions
        g = nn.LogSoftmax(dim=2)
        x_lin1 = self.linear_layer1(x) # Apply linear transformation
        x_lin1 = torch.transpose(x_lin1,1,2) # Transpose for BatchNorm1d
        x_lin1_norm = self.normalization1(x_lin1) # Normalize
        layer1_out = f(x_lin1_norm) # Apply non-linearity
        layer2_in = torch.transpose(layer1_out,1,2)  # Transpose for next linear transformation
        x_lin2 = self.linear_layer2(layer2_in)
        x_lin2 = torch.transpose(x_lin2,1,2)
        x_lin2_norm = self.normalization2(x_lin2)  
        layer2_out = f(x_lin2_norm) 
        layer3_in = torch.transpose(layer2_out,1,2) # Transpose for next linear transformation
        x_lin3 = self.linear_layer3(layer3_in)
        x_lin3 = torch.transpose(x_lin3,1,2)
        x_lin3_norm = self.normalization3(x_lin3)  
        layer3_out = g(x_lin3_norm)
        output = torch.transpose(layer3_out,1,2)      
        return output

Instead, if BatchNorm1d could act upon the final dimension, we wouldn’t need to worry about the dimensions conversions.

I couldn’t find a way to do this in the sequential container, but you are suggesting that I could write a permutation module with no parametes and use this in the sequential container. Right? I am worried a about the necessity to do a copy() or replace the tensor with its transposed version. Also about the contiguous requirment of other pytorch functionalities.

Is there currently any way to avoid the permutation before and after the BatchNorm1d in a sequential container? I just need to apply that before an LSTM layer and I am not able to do the permutation suggested