I have a Seq2Seq model using an RNN encoder and decoder (either LSTM or GRU). Both encoder and decoder can have multiple layers as long as the both numbers are the same (so I can directly give the hidden state of the encoder to the encoder).

Now I want to add attention to that model; I’m looking at Luong Attention right now. I understand the general concept as well as existing examples I can found online. However, all the examples I’ve found so far use only one layer for (at least) the decoder.

For the decoder I have the following value for the calculation of the attention weights:

```
encoder_outputs # encoder_outputs.shape = (batch_size, seq_len, hidden_dim)
hidden # hidden.shape = (num_layers, batch_size, hidden_dim)
```

I now see two alternatives:

(1) I use only the last layer of the decoder, in line with all the examples. In this case, I get 1 attention weight for each token in the encoder output

```
hidden = hidden[-1].unsqueeze(2) # hidden.shape = (batch_size, hidden_dim, 1)
attn_weights = torch.bmm(encoder_outputs, hidden).squeeze(2)
# attn_weights.shape = (batch_size, seq_len)
```

(2) I consider all `num_layers`

layers. In this case, I get `num_layers`

attention weight for each token in the encoder output

```
hidden = hidden.permute(1,2,0) # hidden.shape = (batch_size, hidden_dim, num_layers)
attn_weights = torch.bmm(encoder_outputs, hidden)
# attn_weights.shape = (batch_size, seq_len, num_layers)
```

At the end of the day, I only want 1 attention weight for each token in the encoder output. So I could do `attn_weights[:,:,-1]`

(take the weights of last layer) or `torch.sum(attn_weights, dim=2)`

(sum up weights for all layers) to get the required shape of `(batch_size, seq_len)`

Long story short, are there some valid arguments to favor one approach over the other? In case of considering multiple layers, are the weights of the “non-last” layers more or less meaningful?