Is it true that Bahdanau's attention mechanism is not Global like Luong's according to Pytorch's tutorial?

I was reading the pytorch tutorial on a chatbot task and attention where it said:

Luong et al. improved upon Bahdanau et al.’s groundwork by creating “Global attention”. The key difference is that with “Global attention”, we consider all of the encoder’s hidden states, as opposed to Bahdanau et al.’s “Local attention”, which only considers the encoder’s hidden state from the current time step.

I think that description is plain wrong (or at least confusing). As far as I understand attention in general is the idea that we use a Neural network that depends on the source (or endoder state) and the current target (or decoder) to compute a weight to determine the importance of the current encoder/source in determining the traget/decoder output. Then we do a weighted sum over all context vectors to determine the importance $c_t = \sum^{Tx}{s=1} \alpha{s,t} \bar h_s$ where $\alpha_{s,t}$ is the attention from source/encoder $s$ and $\bar h_s$ is the hidden state from the encoder/source at step $s$.

What is confusing is that the Pytorch tutorials claims that Bahdanau’s work is NOT global. I don’t understand why they say that about Bahdanau’s attention mechanism since to me the following is true:

uses all encoders/source states to compute the context vectors via $c_t = \sum^{Tx}{s=1} \alpha{s,t} \bar h_s$, especially because $\alpha_{s,t}$ is a function of each source/encoder states. So of course it uses all encoder/source states.

Is there something that I am missing? What is the tutorial reffering to?


Perhaps if I go through the equations here carefully I can outline why I think what I do:


Attention is computed as follows:

$$ \alpha_(t) = \alpha_{s,t} = align(h_t, \bar h_s) = \frac{exp( score(h_t, \bar h_s) ) }{\sum^{T_x}{s’=1} exp( score(h_t, \bar h{s’}) )}$$

and the context vector must be:

$$ c_t = \sum^{Tx}{s=1} \alpha{s,t} \bar h_s$$

cuz in the paper it says:

Given the alignment vector as weights, the context
vector $c_t$ is computed as the weighted average over
all the source hidden states.


First I will unify their notation.

  • target/decoder hidden state $h_t = s_t$
  • encoder/source hidden state $\bar h)s = h_s $
  • score $score(h_t, \bar h_s ) = e_{t,s}$
  • aligment $\alpha_{s,t} = \alpha_t(s)$

but the key is that they both use the same equation to compute context vectors:

$$ c_t = \sum^{Tx}{s=1} \alpha{s,t} h_s = \sum^{Tx}_{s=1} \alpha_t(s) \bar h_s $$

Of course there are difference in how the compute hidden states and scores but they BOTH are global attention mechanism. Or am I missing something?