Hi,
Lately I’m working on Seq2Seq Architecture combine with Attention mechanism. I’m using Bidirectional GRU for both Encoder and Decoder. As much as I know about Attention: I use last hidden state of Decoder as query (which has shape (2num_layers, N , H_out) and use Encoder outputs as keys (I think encoder outputs are actually hidden state of each time step t (h_t) which has shape (N, sequence_length, 2H_out))
If I use 1 layer then the hidden state of decoder has shape: (2, N, H_out) → permute (N, 2, H_out)
So to calculate score between query and keys, I need to make it has the same feature size between keys and query.
I want to ask that how the vectors are laid out. I mean (left->right) then (right–>left) of an layer then stack to another (left->right) then (right>left) of another layer. or they are (left->right)*n_layer then stack to (right->left)*n_layer?
In the case I use n_layer = 1, then if I reshape the final hidden state of decoder like. b = b.reshape(batch_size, -1) then how it works? is it concatenate (left->right) first then (right->left) ?
And finally, Is it if the sequence length is 1 and n_layer is 1 so the output, and h_n of RNN is the same? they just have different shape.
Thank you.