I have a Seq2Seq model using an RNN encoder and decoder (either LSTM or GRU). Both encoder and decoder can have multiple layers as long as the both numbers are the same (so I can directly give the hidden state of the encoder to the encoder).
Now I want to add attention to that model; I’m looking at Luong Attention right now. I understand the general concept as well as existing examples I can found online. However, all the examples I’ve found so far use only one layer for (at least) the decoder.
For the decoder I have the following value for the calculation of the attention weights:
encoder_outputs # encoder_outputs.shape = (batch_size, seq_len, hidden_dim)
hidden # hidden.shape = (num_layers, batch_size, hidden_dim)
I now see two alternatives:
(1) I use only the last layer of the decoder, in line with all the examples. In this case, I get 1 attention weight for each token in the encoder output
hidden = hidden[-1].unsqueeze(2) # hidden.shape = (batch_size, hidden_dim, 1)
attn_weights = torch.bmm(encoder_outputs, hidden).squeeze(2)
# attn_weights.shape = (batch_size, seq_len)
(2) I consider all num_layers
layers. In this case, I get num_layers
attention weight for each token in the encoder output
hidden = hidden.permute(1,2,0) # hidden.shape = (batch_size, hidden_dim, num_layers)
attn_weights = torch.bmm(encoder_outputs, hidden)
# attn_weights.shape = (batch_size, seq_len, num_layers)
At the end of the day, I only want 1 attention weight for each token in the encoder output. So I could do attn_weights[:,:,-1]
(take the weights of last layer) or torch.sum(attn_weights, dim=2)
(sum up weights for all layers) to get the required shape of (batch_size, seq_len)
Long story short, are there some valid arguments to favor one approach over the other? In case of considering multiple layers, are the weights of the “non-last” layers more or less meaningful?