nn.TransformerEncoderLayer input/output shape

In the official website, it mentions that the nn.TransformerEncoderLayer is made up of self-attention layers and feedforward network. The first is self-attention layer, and it’s followed by feed-forward network. Here are some input parameters and example

  • d_model – the number of expected features in the input (required).
  • dim_feedforward - the dimension of the feedforward network model (default=2048)
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024)
src = torch.rand(10, 32, 512) 
out = encoder_layer(src)

the output shape is

torch.Size([10, 32, 512])

there are two questions

  1. Why is the output shape [10, 32, 512]? Isn’t it [10, 32, 1024] because the feedforward network is connected after self-attention layer ?
  2. In this example, is the input shape [10, 32, 512] corresponding to [batchsize, seq_length, embedding]?
    In my experience, using nn.Embedding the shape is [batchsize, seq_length, embedding]
    but it looks like the shape is [seq_length, batchsize, embedding] in this tutorial.


  1. The TransformerEncoder “transforms” each input embeddings with the help of neighboring embeddings in the sequence, so it is normal that the output is homogeneous with the input : it should be the same shape as the input.
    You can look at the implementation of nn.TransformerEncoderLayer for more details : you can see where dim_feedforward is used.
  2. See the full nn.Transformer documentation for details on required shapes : It is effectively [seq, batch, emb].
    Note that nn.Embedding works with any shape : [size1, size2] -> [size1, size2, emb]
    So you can use [seq, batch, emb] or [batch, seq, emb] for nn.Embedding.
1 Like

Thank you very much!! This really helps me a lot