I was following this (http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) tutorial.
I see that the attention weights are computed by just using a linear layer and not exactly by multiplying (or aligning) the decoder state with the encoder outputs . Why so ?