We see that the attention mechanism is heavily reliant on the MAX_LENGTH parameter to determine the output dimensions of the attn -> attn_softmax -> attn_weights, i.e.
I understand that the MAX_LENGTH variable is the mechanism to reduce the no. of parameters that needs to be trained in the AttentionDecoderRNN.
If we don’t have a MAX_LENGTH pre-determined. What values should we initialize the attn layer with?
Would it be the output_size? If so, then that’ll be learning the attention with respect to the full vocabulary in the target language. Isn’t that the real intention of the Bahdanau (2015) attention paper?
My understanding is that MAX_LENGTH is used to initialize the size of the attention matrix. You make it as long as the longest sequence you have so that you have room to put any sequence in. So you have a max_len x max_len matrix, and if your input is less than that, it’ll simply have 0’s in the extra space, and it’ll get ignored. You need a hard number because it’s part of the architecture, kinda like how you use hard numbers in CNN’s. Does that help?
so therefore we don’t have the attention be dependent on the actual representation of the source (since it is NOT using the embeddings of the ecodning source vectors). It’s just an attention based on position, which is stupid and makes no sense. It took no effort to use the encoders as input. Say something like:
Do not initialize it to be fixed length. Don’t have a fixed matrix for self.att. In fact, self.att should NOT exist at all really. Remove that line of code and use the encoder source as the input. I’d recommend Luong’s attention so something like:
so have the encoders and decoders determine the weights. The whole point of attention is that the actual semantics of the encoding vector and target vector to determine the output of the RNN. Plus, attention makes things variable length using that. It’s ridiculous how the tutorial destroys all the advantages of RNNs in one go.
For the sake of the “speed” tutorial, one can STILL have the quick speed and still use “max_len” be the maximum sequence length and at the same time be variable length and use the embeddings of the encoder. Using a fixed matrix is just plain lazy and confusing. The max_len isn’t the problem, the fact they used a fixed matrix is.