nn.TransformerEncoderLayer input/output shape

In the official website, it mentions that the nn.TransformerEncoderLayer is made up of self-attention layers and feedforward network. The first is self-attention layer, and it’s followed by feed-forward network. Here are some input parameters and example

  • d_model – the number of expected features in the input (required).
  • dim_feedforward - the dimension of the feedforward network model (default=2048)
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024)
src = torch.rand(10, 32, 512) 
out = encoder_layer(src)
print(out.shape)

the output shape is

torch.Size([10, 32, 512])

there are two questions

  1. Why is the output shape [10, 32, 512]? Isn’t it [10, 32, 1024] because the feedforward network is connected after self-attention layer ?
  2. In this example, is the input shape [10, 32, 512] corresponding to [batchsize, seq_length, embedding]?
    In my experience, using nn.Embedding the shape is [batchsize, seq_length, embedding]
    but it looks like the shape is [seq_length, batchsize, embedding] in this tutorial.

Hi,

  1. The TransformerEncoder “transforms” each input embeddings with the help of neighboring embeddings in the sequence, so it is normal that the output is homogeneous with the input : it should be the same shape as the input.
    You can look at the implementation of nn.TransformerEncoderLayer for more details : you can see where dim_feedforward is used.
  2. See the full nn.Transformer documentation for details on required shapes : It is effectively [seq, batch, emb].
    Note that nn.Embedding works with any shape : [size1, size2] -> [size1, size2, emb]
    So you can use [seq, batch, emb] or [batch, seq, emb] for nn.Embedding.
1 Like

Thank you very much!! This really helps me a lot