Batch Normalization disambiguation

dcasal · April 12, 2019, 10:31am

Hi everybody,
I’m really confused about Batch Normalization’s behaviour in pytorch.
Looking to theory, BN should calculate mean and variance of features in batch samples all together, for each channel.

OK, so if I have a matrix as input value (just like an image) I have 3 options :

BatchNorm2d → as my data is 4 dimensional (N, C, H, W)

BatchNorm1d → by flattening data (N, C, H, W) → (N, C, L)

BatchNorm1d → if my data has onlyone channel (am i right?) i can simply change (N, C, L) → (N, L)

Then i would expect that every BN gives me the same output. But that’s not right. Every output is different.
I’m really really really confused about this! Especially in the difference between BatchNorm1d with input data in shape (N, C, L) and (N, L). Am i right by saying that the shape (N, L) is for one-channeled data?
Thanks for help!

I’m putting here my code (really simple, have a look) :

import torch
import torch.nn as nn


class BN2D(nn.Module) :
    def __init__(self) :
        super(BN2D, self).__init__()
        # nn for mnist
        # input (10, 1, 2, 2)
        self.bn = nn.BatchNorm2d(1)

    def forward(self, x) :
        #flatten data
        x = self.bn(x)
        x = x.view(x.size(0), -1)

        return x

class BN1D(nn.Module) :
    def __init__(self) :
        super(BN1D, self).__init__()
        # nn for mnist
        # input (10, 1, 4)
        self.bn = nn.BatchNorm1d(1)
        # input (10, 4)
        self.bn_1 = nn.BatchNorm1d(4)

    def forward(self, x) :
        #flatten data
        y = x.view(x.size(0), 1, -1)
        y = self.bn(y)
        y = y.view(y.size(0), -1)

        y_1 = x.view(x.size(0), -1)
        y_1 = self.bn_1(y_1)

        return y, y_1


def main() :
    bn1d = BN1D()
    print(bn1d)
    bn2d = BN2D()
    print(bn2d)

    x = torch.randn(10, 1, 2, 2)

    out1d, out1d_1 = bn1d(x)
    out2d = bn2d(x)

    print(out1d)
    print(out1d_1)
    print(out2d)

if __name__ == '__main__':
    main()

ptrblck · April 12, 2019, 10:53pm

No, in this case you would use L channels.

However, the other approache (nn.BatchNorm2d vs. nn.BatchNorm1d) should yield the same result. Since you are using the affine batchnorm transformation, you would have to make sure the weight parameter is set to equal values (bias should be all zeros in both cases anyway).

N, C, H, W = 10, 3, 24, 24
x = torch.randn(N, C, H, W)

bn2d = nn.BatchNorm2d(3)
bn1d = nn.BatchNorm1d(3)

with torch.no_grad():
    bn2d.weight = bn1d.weight
    bn2d.bias = bn1d.bias


output2d = bn2d(x)
output1d = bn1d(x.view(N, C, -1))
print((output2d.view(N, C, -1) == output1d).all())
> tensor(1, dtype=torch.uint8)

Alternatively, you could set affine=False and might skip the parameter assignment.

For completeness: this PR should change the initialization of the affine parameters, such that weight will be initialized with ones.

dcasal · April 18, 2019, 2:20pm

Thank you very much!!