nn.TransformerEncoderLayer input/output shape

lisyuan · October 14, 2020, 12:47pm

In the official website, it mentions that the nn.TransformerEncoderLayer is made up of self-attention layers and feedforward network. The first is self-attention layer, and it’s followed by feed-forward network. Here are some input parameters and example

d_model – the number of expected features in the input (required).
dim_feedforward - the dimension of the feedforward network model (default=2048)

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024)
src = torch.rand(10, 32, 512) 
out = encoder_layer(src)
print(out.shape)

the output shape is

torch.Size([10, 32, 512])

there are two questions

Why is the output shape [10, 32, 512]? Isn’t it [10, 32, 1024] because the feedforward network is connected after self-attention layer ?
In this example, is the input shape [10, 32, 512] corresponding to [batchsize, seq_length, embedding]?
In my experience, using nn.Embedding the shape is [batchsize, seq_length, embedding]
but it looks like the shape is [seq_length, batchsize, embedding] in this tutorial.

phan_phan · October 14, 2020, 2:23pm

Hi,

The TransformerEncoder “transforms” each input embeddings with the help of neighboring embeddings in the sequence, so it is normal that the output is homogeneous with the input : it should be the same shape as the input.
You can look at the implementation of nn.TransformerEncoderLayer for more details : you can see where dim_feedforward is used.
See the full nn.Transformer documentation for details on required shapes : It is effectively [seq, batch, emb].
Note that nn.Embedding works with any shape : [size1, size2] -> [size1, size2, emb]
So you can use [seq, batch, emb] or [batch, seq, emb] for nn.Embedding.

lisyuan · October 15, 2020, 3:11am

Thank you very much!! This really helps me a lot