Converting from nn.BatchNorm2d to nn.LayerNorm in CNN

asberman · November 27, 2018, 9:00am

For improved Wasserstein GAN (aka Wasserstein GAN with gradient penalty [WGAN-GP]), layer normalization is recommended in the discriminator, as opposed to nn.BatchNorm2d

I see that nn.LayerNorm was (relatively) recently added to torch.nn.modules, and I’d like to use it, as opposed to writing my own layer normalization. However, just replacing calls to nn.BatchNorm2d(input_size) with nn.LayerNorm(input_size) gives the following error:

File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 1314, in layer_norm
PSTtorch.backends.cudnn.enabled)
PSTRuntimeError: Given normalized_shape=[128], expected input with shape [*, 128], 
but got input of size[128, 128, 16, 16]

This is the structure of the network:

class WgangpDiscriminator(nn.Module):  
    def __init__(self, channels, num_disc_filters):
        super(WgangpDiscriminator, self).__init__()
        self.ngpu = 1
        self.main = nn.Sequential(
        #num_disc_filters = 64
        nn.Conv2d(channels, num_disc_filters, 4, 2, 1, bias=False),
        nn.LeakyReLU(0.2, inplace=True),

        nn.Conv2d(num_disc_filters, num_disc_filters * 2, 4, 2, 1, bias=False),
        nn.LayerNorm(num_disc_filters * 2),
        nn.LeakyReLU(0.2, inplace=True),

        nn.Conv2d(num_disc_filters * 2, num_disc_filters * 4, 4, 2, 1, bias=False),
        nn.LayerNorm(num_disc_filters * 4),
        nn.LeakyReLU(0.2, inplace=True),

        # state size. (num_disc_filters*4) x 8 x 8
        # nn.Conv2d(num_disc_filters * 4, num_disc_filters * 8, 4, 2, 1, bias=False),
        # nn.BatchNorm2d(num_disc_filters * 8),
        # nn.LeakyReLU(0.2, inplace=True),

        # was num_disc_filters * 16
        nn.Conv2d(num_disc_filters * 4, num_disc_filters * 8, 4, 2, 1, bias=False),
        nn.LayerNorm(num_disc_filters * 8),
        nn.LeakyReLU(0.2, inplace=True),
        
        # was num_disc_filters * 16
        nn.Conv2d(num_disc_filters * 8, 1, 4, 1, 0, bias=False),
        # ty  https://github.com/pytorch/examples/issues/70 apaske
    )

    def forward(self, inp):
         if isinstance(inp.data, torch.cuda.FloatTensor) and self.ngpu > 1:
             output = nn.parallel.data_parallel(self.main, inp, range(self.ngpu))
         else:
             output = self.main(inp)
         return output.view(-1, 1).squeeze(1)

nn.LayerNorm expects normalized_shape as input ( an int, list or torch.Size), but nn.Conv2d layers don’t have .size , .get_shape() or .shape(), so I can’t follow the example in the docs:

input = torch.randn(20, 5, 10, 10) 
# With Learnable Parameters 
m = nn.LayerNorm(input.size()[1:])

How do I do this conversion?

Many thanks in advance!

b037059c4e221d6caee8 · January 2, 2019, 2:29am

Hi,I meet the same problem, do you solve this problem? THX

11157 · March 11, 2020, 5:06am

I also meet this problem, do you solve this? Thanks.

AboMAd · May 8, 2020, 7:38pm

Hi asberman,

As I understand LayerNorm will compute mean and variance elementwise (not per batch), thus you should pass the spatial dimension of the input, not the channel dimension as in the case of BatchNorm.
Actually, I am doing the same work, and you can try to change the following:
the first layer norm :
nn.LayerNorm(num_disc_filters * 2), --> nn.LayerNorm([num_disc_filters * 2, 16, 16]),

the second:
nn.LayerNorm(num_disc_filters * 4), --> nn.LayerNorm([num_disc_filters * 4, 8, 8]),

the third:
nn.LayerNorm(num_disc_filters * 8), --> nn.LayerNorm([num_disc_filters * 8, 4, 4]),

I think this will work without error.
However, I found that even after replacing the BatchNorm with the LayerNorm, the model still diverges!!!
If you have solved this properly, please inform me.

Thanks

sharkdeng · June 27, 2020, 4:57am

I found your answer is like hard coding, how to use an unify way?