Problems about official tutorial ‘Translation with a Sequence to Sequence Network and Attention’

I am learning attention mechanism through reading official tutorial ‘Translation with a Sequence to Sequence Network and Attention’
but I think there is something wrong about implementation of attention decoder:

we can find that attention weights are calculated by code below:
attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)))

however,the dimention of hidden is (num_layers * num_directions, batch, hidden_size)
while the dimention of embedded is (batch, seq_len, embedding_dim) or (seq_len, batch, embedding_dim)
Obviously seq_len is usually not equal with num_layers * num_directions , so it is reasonable to concatenate hidden and embedded?

I am a beginner of NLP and hope that someone who understand the tutorial can explain the reason why they concatenate hidden and embedded, Thank you!

I am late but for other beginners like myself I’ll try to answer this.
Let’s see each component individually:

  1. Embedding matrix: The dimension of embedded is (seq_len, batch, embedding_dim). Important thing to note in this tutorial is that we are passing input one at a time to network i.e input will be “he” then “is” then “painting” so on rather than “he is painting” in one go. This means seq_len will always be 1. So embedded will be of shape (1, batch, embedding_dim) and hence embedded[0] will be of shape (batch, embedding_dim).
  2. Hidden weights: the dimension of hidden is (num_layers * num_directions, batch, hidden_size). For this tutorial num_layers=1 and num_directions=1 so the dimension becomes (1 * 1, batch, hidden_size). So hidden[0] becomes (batch, hidden_size)

From above to if we concatenate embedded[0] (batch, embedding_dim) and hidden[0] (batch, hidden_size) along dim=1 we safely get tensor of size (batch, hidden_size + embedding_dim) for each time step of the input.

Hopefully it Helps!