Attention for RNN Decoder with multiple layers

I have a Seq2Seq model using an RNN encoder and decoder (either LSTM or GRU). Both encoder and decoder can have multiple layers as long as the both numbers are the same (so I can directly give the hidden state of the encoder to the encoder).

Now I want to add attention to that model; I’m looking at Luong Attention right now. I understand the general concept as well as existing examples I can found online. However, all the examples I’ve found so far use only one layer for (at least) the decoder.

For the decoder I have the following value for the calculation of the attention weights:

encoder_outputs  # encoder_outputs.shape = (batch_size, seq_len, hidden_dim)
hidden           # hidden.shape = (num_layers, batch_size, hidden_dim)

I now see two alternatives:

(1) I use only the last layer of the decoder, in line with all the examples. In this case, I get 1 attention weight for each token in the encoder output

hidden = hidden[-1].unsqueeze(2) # hidden.shape = (batch_size, hidden_dim, 1)
attn_weights = torch.bmm(encoder_outputs, hidden).squeeze(2)
# attn_weights.shape = (batch_size, seq_len)

(2) I consider all num_layers layers. In this case, I get num_layers attention weight for each token in the encoder output

hidden = hidden.permute(1,2,0) # hidden.shape = (batch_size, hidden_dim, num_layers)
attn_weights = torch.bmm(encoder_outputs, hidden)
# attn_weights.shape = (batch_size, seq_len, num_layers)

At the end of the day, I only want 1 attention weight for each token in the encoder output. So I could do attn_weights[:,:,-1] (take the weights of last layer) or torch.sum(attn_weights, dim=2) (sum up weights for all layers) to get the required shape of (batch_size, seq_len)

Long story short, are there some valid arguments to favor one approach over the other? In case of considering multiple layers, are the weights of the “non-last” layers more or less meaningful?

2 Likes

I have the same question. I tried both empirically but the results didn’t indicate too much of a difference. I think intuitively, it makes sense to use your first approach because using torch.sum makes many output values the same.

I think I’ve found the code here: torch/nn/modules/rnn.py, specifically these lines:

for layer in range(num_layers):
    for direction in range(num_directions):
        suffix = '_reverse' if direction == 1 else ''
        weights = ['weight_ih_l{}{}', 'weight_hh_l{}{}', 'bias_ih_l{}{}',
                           'bias_hh_l{}{}', 'weight_hr_l{}{}']
        weights = [x.format(layer, suffix) for x in weights]
        ...
        ...
        ...
        self._all_weights += [weights[:4]] 
        # (ih, ih_reverse) for every layer ==> Last layer: [index=-1] or last 2 [index=-2:]

Since weights are stored one after the other, I’d like to safely assume that attn_weights stores them one after the other in a tuple format: [h1, h1_b, h1_reverse, h1_b_reverse …] * num_layers

2 Likes