# Problems about official tutorial ‘Translation with a Sequence to Sequence Network and Attention’

I am learning attention mechanism through reading official tutorial ‘Translation with a Sequence to Sequence Network and Attention’
but I think there is something wrong about implementation of attention decoder:

we can find that attention weights are calculated by code below：
`attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)))`

however，the dimention of hidden is `(num_layers * num_directions, batch, hidden_size)`
while the dimention of embedded is `(batch, seq_len, embedding_dim)` or `(seq_len, batch, embedding_dim)`
Obviously `seq_len` is usually not equal with `num_layers * num_directions` ， so it is reasonable to concatenate hidden and embedded？

I am a beginner of NLP and hope that someone who understand the tutorial can explain the reason why they concatenate hidden and embedded, Thank you!

I am late but for other beginners like myself I’ll try to answer this.
Let’s see each component individually:

1. Embedding matrix: The dimension of `embedded` is `(seq_len, batch, embedding_dim)`. Important thing to note in this tutorial is that we are passing input one at a time to network i.e input will be “he” then “is” then “painting” so on rather than “he is painting” in one go. This means `seq_len` will always be `1`. So `embedded` will be of shape `(1, batch, embedding_dim)` and hence `embedded[0]` will be of shape `(batch, embedding_dim)`.
2. Hidden weights: the dimension of hidden is `(num_layers * num_directions, batch, hidden_size)`. For this tutorial `num_layers=1` and `num_directions=1` so the dimension becomes `(1 * 1, batch, hidden_size)`. So `hidden[0]` becomes `(batch, hidden_size)`

From above to if we concatenate embedded[0] (batch, embedding_dim) and hidden[0] `(batch, hidden_size)` along `dim=1` we safely get tensor of size `(batch, hidden_size + embedding_dim)` for each time step of the input.

Hopefully it Helps!