Original Encoder-Decoder Transformer: Text Generation?

In order to translate a sentence with the original encoder-decoder transformer, the following happens:

  1. The source sentence is encoded,
  2. the decoder initially gets as input a start token, learns to generate a new token based on the previously seen tokens, which is concatenatted with the decoder input tokens.

Let’s say we want to translate the sentence “Wie geht es dir?”
This sentence would be fed into the encoder (of course first tokenized), and the decoder gets as input the start token [SOS] in shape (1, 1). There’s now a multi-head attention block that whose queries and keys come from the encoder output in shape (1, num_heads, seq_length, embed_dim) and the values come from the previous masked multi-head attention block in shape (1, num_heads, 1, embed_dim) (so effectively, when we feed the token [SOS] into the decoder, we have a sequence length of 1).

In self-attention, we calculate softmax(QK^T / sqrt(d_v)), which has a resulting shape of (1, num_heads, seq_length, seq_length), but now the multiplication with V would fail and result in a shape mismatch?

Maybe this notebook might help. It is a extended version of an official PyTorch tutorial. Or maybe this notebook, going through the Transformer implementation using an from-scratch implementation. If nothing else, you can add print statements to check all the relevant tensor shapes