In order to translate a sentence with the original encoder-decoder transformer, the following happens:
- The source sentence is encoded,
- the decoder initially gets as input a start token, learns to generate a new token based on the previously seen tokens, which is concatenatted with the decoder input tokens.
Let’s say we want to translate the sentence “Wie geht es dir?”
This sentence would be fed into the encoder (of course first tokenized), and the decoder gets as input the start token [SOS]
in shape (1, 1)
. There’s now a multi-head attention block that whose queries and keys come from the encoder output in shape (1, num_heads, seq_length, embed_dim)
and the values come from the previous masked multi-head attention block in shape (1, num_heads, 1, embed_dim)
(so effectively, when we feed the token [SOS]
into the decoder, we have a sequence length of 1
).
In self-attention, we calculate softmax(QK^T / sqrt(d_v))
, which has a resulting shape of (1, num_heads, seq_length, seq_length)
, but now the multiplication with V
would fail and result in a shape mismatch?