Hi. I’m a little bit struggling to implement attention mechanisms and I got questions during implementing it.

Some implements including PyTorch tutorial uses the last hidden state of the encoder as the initial hidden state of the decoder in Bahdanau attention. I read the original paper, they don’t mention this. So what is the main reason for using the last hidden state of the encoder as the initial hidden state of the decoder?

When LSTM is employed to both encoder and decoder, which do I use between hidden state and cell state for calculating the attention score?

In Luong attention, the output(=$H$) of the encoder’s dimension is [Seq_len, Batch, num_dir(=2) * hidden_dim]. However, the dimension of the encoder’s output [Seq_len, Batch, num_dir(=2) * enc_hidden_dim] doesn’t match to the decoder’s output [Seq_len, Batch, num_dir(=1) * dec_hidden_dim] when the method of alignment function is dot. So how do I do to handle it?

The answer to this is in the appendix section A.2.2 of the paper.

Hidden

Hopefully, someone will be able to shed more light on this. However, one crude way to make the dot product work is set dec_hidden_dim = 2*enc_hidden_dim.

Thank you very much. The answer is very clear. The pytorch tutorial for chat bot imploies Luong attention and it solves the unmatching dimension of dot align function problem by using summation on hidden dimension.