Batch Normalization of Linear Layers

reshape the input to [batch_num * temporal_length, channel_dim], then do nn.BatchNorm1d(num_features=channel_dim , will this work correctly?

I would permute the input to the shape [batch_size, channels, seq_len] and apply the batchnorm layer.
This would normalize the values in the temporal dimension using the channel stats.

yeah, but in this way, the feature is normalized within a seq_len not a batch ?

I’m not sure what “within a seq_len not a batch” means, but this code snippet would show how the normalization is applied internally:

N, C, L = 2, 3, 4
x = torch.randn(N, C, L) * 10 + 5
bn = nn.BatchNorm1d(C)
out = bn(x)

out_manual = (x - x.mean([0, 2], keepdims=True)) / x.std([0, 2], unbiased=False, keepdims=True)
print(torch.allclose(out_manual, out))
> True
1 Like

Hi ,
Is there any way to use functional like F.relu in sequential case. Like i am running on problem on implementing vgg, Like in official implementation,

 if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]

but i dont want to use nn.ReLU insted i want to use F.relu, How can i do that,
does this would work or any idea to do this


if batchnorm:
    layers+= [conv2d, F.relu(nn.BatchNorm2d(v), inplace=True)]
else:
    layers += [F.relu(conv2d, inplace=True)]

Thanks for idea.

No this will most likely not work, as an nn.Module would be needed.
You could wrap the functional call into a custom module or just use nn.ReLU() (which would be the same).
Why don’t you want to use nn.ReLU() directly?

I am passing those activation functions in tuple (F.relu,F.selu,…) as an function argument so that it could be used in conv layers.
But problem with nn module classes, is custom defined activation classes would be messy to write , instead others person could just pass function name in argument, that will be helpful in future and experimenting with new custom function.
so , please tell how can i solve this.

I don’t understand the use case. How would you like to pass e.g. F.relu as an argument to a conv layer?

I had used it before in other architectures, Here is snippet

class LeNet():
    ....
    def forward(self, x, activations=None):
        """
        Parameters:
          x: input tenosor 
          activations: set of 5 activation functions for each conv and linear layer.
        """
        act1 = act2 = act3 =  act4 = act5 = F.relu

        if activations is not None:
          (act1, act2, act3, act4, act5) = activations

        x = act1(self.conv1(x))
        x = self.maxpool1(x) if self.pool1=='max' else self.avgpool1(x)
        ....

and i had called train func which takes LeNet as

model, train_acc = train(train_loader, LeNet, epochs=10, lr=rate, use_cuda=True, pools=pool, activations=activation,

In which train function takes model and takes activations and applies to model on given combinations of activations ,

def train(trainloaer, model, epochs, activations):
    model = model(1, pools)
    pred = model(image, activations)

but problem i am getting is how can i use similar technique to use in make_layers function which works in sequential, so that it can be used in any vgg family.
As

def make_layers(...., activations=(act1, act2....))
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]

and that arguments activations function could be used replacing nn.RelU or any custom made activation function.
I wanted to use Functional because when using other activations in functional it was working than using nn actviations function. Although nn.ReLU can be passed in this case but some other activation function wouldnt work
How can i solve it.
Help!! @ptrblck

You could try to assign the object to act and create the module in the list before wrapping it in an nn.Sequential container:

act = nn.ReLU
layers += [conv2d, nn.BatchNorm2d(v), act(inplace=True)]

Thanks, but that means i cant use Functional , you told earlier

How can i do that?

You would just recreate the nn.ReLU() module, which wouldn’t make any difference (or maybe it fits perfectly your use case and I misunderstand it):

class MyAct(nn.Module):
    def __init__(self, act):
        super(MyAct, self).__init__()
        self.act = act

    def forward(self, x):
        x = self.act(x)
        return x

layers += [conv2d, nn.BatchNorm2d(v), MyAct(F.relu)]
2 Likes

Thanks , It will sove problem :slight_smile:

1 Like

Is there a way to take the mean/std across only the batch dimension, and thus elementwise normalize every element in each tensor of shape [C, L]? I.e., the equivalent of this:

out_manual = (x - mean(x, dim=0)) / std(x, dim=0))

Would reshaping to [N, C*L], and then nn.BatchNorm1d(num_features=C*L) do the trick?
Or, although it’s an abuse of the concept of layer normalization, would this be better/more performant:

x = x.transpose([1, 2, 0])  # [C, L, N]
nn.LayerNorm(N)

The problem in this latter case is that the model has to be initialized with the batch size (and thus this must stay constant for the entire training).

Reshaping the batchnorm might work, if I understand the use case correctly.
Note that since you are normalizing each element individually, the stats would be only calculated from the batch size. For small batch sizes your stats might be very shaky, but maybe it fits your use case.
Let us know, how your experiments went :wink:

1 Like

Hi, thanks for the answer
Is there a point doing a batch norm layer directly on the input? even instead of a normalizing transform

I haven’t seen a lot of implementation using this approach and, as so often, it might depend on your use case, but if the Normalize transformation already normalizes the inputs to a zero mean and unit variance, the batchnorm layer wouldn’t do much more besides adding its affine parameters (if used).

If the normalize transform doesn’t normalize the inputs, for example if the input is an online stream, or if preprocessing is expensive, would it make sense to use batch norm on the input?

It could work and you should definitely experiment with it.
Note that the running stats will be updated in each forward pass using the current batch statistics. If your input data changes the stats after a while, the running stats of the batchnorm layer would also “track” these changes and might thus perform badly on the data from the first iterations.