In the official website, it mentions that the nn.TransformerEncoderLayer
is made up of self-attention layers and feedforward network. The first is self-attention layer, and it’s followed by feed-forward network. Here are some input parameters and example
- d_model – the number of expected features in the input (required).
- dim_feedforward - the dimension of the feedforward network model (default=2048)
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024)
src = torch.rand(10, 32, 512)
out = encoder_layer(src)
print(out.shape)
the output shape is
torch.Size([10, 32, 512])
there are two questions
- Why is the output shape [10, 32, 512]? Isn’t it [10, 32, 1024] because the feedforward network is connected after self-attention layer ?
- In this example, is the input shape [10, 32, 512] corresponding to [batchsize, seq_length, embedding]?
In my experience, usingnn.Embedding
the shape is [batchsize, seq_length, embedding]
but it looks like the shape is [seq_length, batchsize, embedding] in this tutorial.