Attention for RNN Decoder with multiple layers

I have a Seq2Seq model using an RNN encoder and decoder (either LSTM or GRU). Both encoder and decoder can have multiple layers as long as the both numbers are the same (so I can directly give the hidden state of the encoder to the encoder).

Now I want to add attention to that model; I’m looking at Luong Attention right now. I understand the general concept as well as existing examples I can found online. However, all the examples I’ve found so far use only one layer for (at least) the decoder.

For the decoder I have the following value for the calculation of the attention weights:

encoder_outputs  # encoder_outputs.shape = (batch_size, seq_len, hidden_dim)
hidden           # hidden.shape = (num_layers, batch_size, hidden_dim)

I now see two alternatives:

(1) I use only the last layer of the decoder, in line with all the examples. In this case, I get 1 attention weight for each token in the encoder output

hidden = hidden[-1].unsqueeze(2) # hidden.shape = (batch_size, hidden_dim, 1)
attn_weights = torch.bmm(encoder_outputs, hidden).squeeze(2)
# attn_weights.shape = (batch_size, seq_len)

(2) I consider all num_layers layers. In this case, I get num_layers attention weight for each token in the encoder output

hidden = hidden.permute(1,2,0) # hidden.shape = (batch_size, hidden_dim, num_layers)
attn_weights = torch.bmm(encoder_outputs, hidden)
# attn_weights.shape = (batch_size, seq_len, num_layers)

At the end of the day, I only want 1 attention weight for each token in the encoder output. So I could do attn_weights[:,:,-1] (take the weights of last layer) or torch.sum(attn_weights, dim=2) (sum up weights for all layers) to get the required shape of (batch_size, seq_len)

Long story short, are there some valid arguments to favor one approach over the other? In case of considering multiple layers, are the weights of the “non-last” layers more or less meaningful?

1 Like