I have a Seq2Seq model using an RNN encoder and decoder (either LSTM or GRU). Both encoder and decoder can have multiple layers as long as the both numbers are the same (so I can directly give the hidden state of the encoder to the encoder).
Now I want to add attention to that model; I’m looking at Luong Attention right now. I understand the general concept as well as existing examples I can found online. However, all the examples I’ve found so far use only one layer for (at least) the decoder.
For the decoder I have the following value for the calculation of the attention weights:
encoder_outputs # encoder_outputs.shape = (batch_size, seq_len, hidden_dim) hidden # hidden.shape = (num_layers, batch_size, hidden_dim)
I now see two alternatives:
(1) I use only the last layer of the decoder, in line with all the examples. In this case, I get 1 attention weight for each token in the encoder output
hidden = hidden[-1].unsqueeze(2) # hidden.shape = (batch_size, hidden_dim, 1) attn_weights = torch.bmm(encoder_outputs, hidden).squeeze(2) # attn_weights.shape = (batch_size, seq_len)
(2) I consider all
num_layers layers. In this case, I get
num_layers attention weight for each token in the encoder output
hidden = hidden.permute(1,2,0) # hidden.shape = (batch_size, hidden_dim, num_layers) attn_weights = torch.bmm(encoder_outputs, hidden) # attn_weights.shape = (batch_size, seq_len, num_layers)
At the end of the day, I only want 1 attention weight for each token in the encoder output. So I could do
attn_weights[:,:,-1] (take the weights of last layer) or
torch.sum(attn_weights, dim=2) (sum up weights for all layers) to get the required shape of
Long story short, are there some valid arguments to favor one approach over the other? In case of considering multiple layers, are the weights of the “non-last” layers more or less meaningful?