AttentionDecoderRNN without MAX_LENGTH

From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder

We see that the attention mechanism is heavily reliant on the MAX_LENGTH parameter to determine the output dimensions of the attn -> attn_softmax -> attn_weights, i.e.

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

More specifically

self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

I understand that the MAX_LENGTH variable is the mechanism to reduce the no. of parameters that needs to be trained in the AttentionDecoderRNN.

If we don’t have a MAX_LENGTH pre-determined. What values should we initialize the attn layer with?

Would it be the output_size? If so, then that’ll be learning the attention with respect to the full vocabulary in the target language. Isn’t that the real intention of the Bahdanau (2015) attention paper?

My understanding is that MAX_LENGTH is used to initialize the size of the attention matrix. You make it as long as the longest sequence you have so that you have room to put any sequence in. So you have a max_len x max_len matrix, and if your input is less than that, it’ll simply have 0’s in the extra space, and it’ll get ignored. You need a hard number because it’s part of the architecture, kinda like how you use hard numbers in CNN’s. Does that help?

The tutorial is ridiculously stupid. Ignore it. Read this:

but the point is that their attention layer is fixed matrix. From 2h to max_len as you noted. Note that the attention weights are computed as follow:

exp( target_embedding[t] * self.att ) # (1, self.max_length)

so therefore we don’t have the attention be dependent on the actual representation of the source (since it is NOT using the embeddings of the ecodning source vectors). It’s just an attention based on position, which is stupid and makes no sense. It took no effort to use the encoders as input. Say something like:

exp( target_embeddings[t] * enconder_outputs )

or

attn_weights = F.softmax( embedded[0]*encoder_outputs )

something like that.

Do not initialize it to be fixed length. Don’t have a fixed matrix for self.att. In fact, self.att should NOT exist at all really. Remove that line of code and use the encoder source as the input. I’d recommend Luong’s attention so something like:

alpha_{s,t} = align( d_t, e_s ) ~ exp( < d_t, e_s > )

so have the encoders and decoders determine the weights. The whole point of attention is that the actual semantics of the encoding vector and target vector to determine the output of the RNN. Plus, attention makes things variable length using that. It’s ridiculous how the tutorial destroys all the advantages of RNNs in one go.

For the sake of the “speed” tutorial, one can STILL have the quick speed and still use “max_len” be the maximum sequence length and at the same time be variable length and use the embeddings of the encoder. Using a fixed matrix is just plain lazy and confusing. The max_len isn’t the problem, the fact they used a fixed matrix is.

1 Like