Output of Bidirectional RNNs and Attention


Lately I’m working on Seq2Seq Architecture combine with Attention mechanism. I’m using Bidirectional GRU for both Encoder and Decoder. As much as I know about Attention: I use last hidden state of Decoder as query (which has shape (2num_layers, N , H_out) and use Encoder outputs as keys (I think encoder outputs are actually hidden state of each time step t (h_t) which has shape (N, sequence_length, 2H_out))

If I use 1 layer then the hidden state of decoder has shape: (2, N, H_out) → permute (N, 2, H_out)

So to calculate score between query and keys, I need to make it has the same feature size between keys and query.

I want to ask that how the vectors are laid out. I mean (left->right) then (right–>left) of an layer then stack to another (left->right) then (right>left) of another layer. or they are (left->right)*n_layer then stack to (right->left)*n_layer?

In the case I use n_layer = 1, then if I reshape the final hidden state of decoder like. b = b.reshape(batch_size, -1) then how it works? is it concatenate (left->right) first then (right->left) ?

And finally, Is it if the sequence length is 1 and n_layer is 1 so the output, and h_n of RNN is the same? they just have different shape.
Thank you.

Maybe this notebook might help. It covers an Machine Translation training using an RNN-based encoder-decoder architecture using attention. The encoder can be configured to be bi-directional.

Let me know if you have any questions regarding the code.

1 Like

Sorry for late response. Thank you