Why do attention models need to choose a maximum sentence length?

pinocchio · June 6, 2019, 3:34am

I was going through the seq2seq-translation tutorial on pytorch and found the following sentence:

Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

which didn’t really make sense to me. My understanding of attention is that attention is computed as follows (according to the Pointer Network paper) at time step $t$:

$$ u^{<t,j>} = v^\top tanh( W_1 e_j + W_2 d_{t} ) = NN_u(e_j, d_t )$$
$$ \alpha^{<t,j>} = softmax( u^{<t,j>} ) = \frac{u^{<t,j>}}{Z^{}}$$
$$ d’{<i+1>} = \sum^{T_x}{j=1} \alpha^{<t,j>} e_j $$

which basically means that a specific attention weight is not dependent on the length of the encoder (i.e. the encoder can change size and the above equation won’t be affected because $T_x$ can be variable size).

If that is true then why does the paper say this maximum sentence length thing?

They also say:

There are other forms of attention that work around the length limitation by using a relative position approach. Read about “local attention” in Effective Approaches to Attention-based Neural Machine Translation.

which also confused me. Any clarification?