How to use layer norm after con 1d layer?

Cant get what to pass as argument

1 Like

You might want to read Understanding Convolution 1D output and Input

I know how 1d conv work, but i cant get what to pass to layernorm.
If i pass num features (like batch norm), it expects it as last dim.

I think we have 3 options

a = nn.Conv1d(3, 3, 3) # in channels 3, out channels 3, kernel size 3
x = torch.randn(1, 3, 6) # batch size 1, 3 channels, 6 length of sequence
a(x).shape

torch.Size([1, 3, 4])
first option

b = nn.LayerNorm([4])

second

b = nn.LayerNorm([3, 4])

third

b = nn.LayerNorm([1, 3, 4])

and then

b(a(x))

in the first case, mean, variance would be like

a.mean([2]), a.var([2], unbiased=False)

second case

a.mean([1, 2]), a.var([1, 2], unbiased=False)

third case

a.mean([0, 1, 2]), a.var([0, 1, 2], unbiased=False)

that is

a.mean(), a.var(unbiased=False)

So what is right way?
I cant know lenght couz it is different in every batch.
Here they just pass feature dim.
Should [fearuredim, 1] work?

I think layer norm is generally used after nn.Embedding because we do not want to mix one word’s embedding with another word’s embedding while normalizing.

I think you could go with other normalizing technique like batchnorm, if you want to use layernorm after applying conv1d, then you will have to pass size of last dim, that would be

I wonna compare batchnorm with another types of norm layers.
Seems like i need to use groupnorm.
(not sure if give same performance)

I think doing

x = torch.randn(1, 3, 6) # batch size 1, 3 channels, 6 length of sequence
a = nn.Conv1d(3, 6, 3) # in channels 3, out channels 6, kernel size 3
gn = nn.GroupNorm(1, 6)
gn(a(x))

tensor([[[-0.1459, 0.5860, 0.1771, 1.1413],
[-0.8613, 2.7552, -1.0135, 0.8898],
[-0.1119, -0.1656, -0.4536, -0.9865],
[ 0.6755, -1.3193, 1.2248, -0.5849],
[ 1.2789, -0.5229, 0.1345, 0.1763],
[-2.1555, 0.0149, -0.2769, -0.4565]]], grad_fn=)

is equivalent to

ln = nn.LayerNorm([6, 4])
ln(a(x))

tensor([[[-0.1459, 0.5860, 0.1771, 1.1413],
[-0.8613, 2.7552, -1.0135, 0.8898],
[-0.1119, -0.1656, -0.4536, -0.9865],
[ 0.6755, -1.3193, 1.2248, -0.5849],
[ 1.2789, -0.5229, 0.1345, 0.1763],
[-2.1555, 0.0149, -0.2769, -0.4565]]],
grad_fn=)

so we could do
nn.GroupNorm(1, out_channels)
and we will not have to specify Lout after applying Conv1d and it would act as second case of LayerNorm specified above.

So, to compare batchnorm with groupnorm or 2nd case of layernorm, we would have to replace

nn.BatchNorm1d(out_channels)

with

nn.GroupNorm(1, out_channels)
1 Like

Hello, @vainaijr, I am running into the same issue. Could you possibly just transpose the tensor from (batch, feature, seq_len) —> (batch, seq_len, features)? then feed that through layernorm, and then re-transpose back? Would this work?

yes I think it would work, for example,

x = torch.randn(1, 3, 4) # 3 words, each represented by four size embedding
# if we had (batch, features, seq_len), then we transpose it to (batch, seq_len, features)
x

tensor([[[-0.0247, -0.0365, 0.0992, 0.8617],
[ 0.9550, 1.0243, -0.7017, -1.1300],
[ 1.2312, 1.0802, 0.3377, 1.8007]]])

y = nn.LayerNorm([4]) 
# if we do not want to normalize one word based on other word.
# normalize based on individual word representation
y(x)

tensor([[[-0.6720, -0.7037, -0.3385, 1.7142],
[ 0.9514, 1.0232, -0.7654, -1.2092],
[ 0.2275, -0.0618, -1.4847, 1.3190]]],
grad_fn=)

which is equivalent to result of,

((x.permute(2, 0, 1) - x.mean([2]))/(x.var([2], unbiased=False) + 1e-05).sqrt()).permute(1, 2, 0)
1 Like