I am learning attention mechanism through reading official tutorial ‘Translation with a Sequence to Sequence Network and Attention’

but I think there is something wrong about implementation of attention decoder:

we can find that attention weights are calculated by code below：

`attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)))`

however，the dimention of hidden is `(num_layers * num_directions, batch, hidden_size)`

while the dimention of embedded is `(batch, seq_len, embedding_dim)`

or `(seq_len, batch, embedding_dim)`

Obviously `seq_len`

is usually not equal with `num_layers * num_directions`

， so it is reasonable to concatenate hidden and embedded？

I am a beginner of NLP and hope that someone who understand the tutorial can explain the reason why they concatenate hidden and embedded, Thank you!

I am late but for other beginners like myself I’ll try to answer this.

Let’s see each component individually:

- Embedding matrix: The dimension of
`embedded`

is `(seq_len, batch, embedding_dim)`

. Important thing to note in this tutorial is that we are passing input one at a time to network i.e input will be “he” then “is” then “painting” so on rather than “he is painting” in one go. This means `seq_len`

will always be `1`

. So `embedded`

will be of shape `(1, batch, embedding_dim)`

and hence `embedded[0]`

will be of shape `(batch, embedding_dim)`

.
- Hidden weights: the dimension of hidden is
`(num_layers * num_directions, batch, hidden_size)`

. For this tutorial `num_layers=1`

and `num_directions=1`

so the dimension becomes `(1 * 1, batch, hidden_size)`

. So `hidden[0]`

becomes `(batch, hidden_size)`

From above to if we concatenate embedded[0] (batch, embedding_dim) and hidden[0] `(batch, hidden_size)`

along `dim=1`

we safely get tensor of size `(batch, hidden_size + embedding_dim)`

for each time step of the input.

Hopefully it Helps!