This image is from the tutorial at https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py.
When deriving the attention weights, the input of “attn” is the concatenation of previous decoder hidden state and current decoder input, output is the attention weights who will be applied to the encoder outputs. The shape of this output is (batch_size, FIXED_length_of_encoder_input, 1)
BUT according to “Neural Machine Translation by Jointly Learning to Align and Translate”, when deriving attention weights we should use the concatenation of previous decoder hidden state and each of the encoder outputs to get the corresponding energy e_ij. Output size is (batch_size, ACTUAL_length_of_encoder_input, 1)
Could anyone justify this offical tutorial implementation? (I mean using the decoder input in deriving attention weights) A reference paper also helps.
There is another question:
In every forward function of the tutorial, it reshapes the embedding:
embedded = self.embedding(input).view(1, 1, -1)
Why is that?