Wrong implementation of Attention in Pytorch examples

Hi everyone,
I recently tried to implement the attention mechanism in Pytorch. I searched lots of github repos and also the official pytorch implementation here plus detailed tutorials such as this one on floydhub. However, it seems to me all of them have implemented the attention mechanism incorrectly!

The problem that I see, is that, in the papers (both of them both, bahdanaus and luoungs), the score is calculated by comparing all encoders hidden states with a single timestep in the decoder, the cuurent/previous timestep to be exact.

The score is calculated and then used along the way so ultimately the current timestep in the decoder can produce one output .

Now if you look at the codes, you can see, all of them used an lstm/gru layer instead of a lstm/gru cell where each timestep is exposed!

They simply define the attention mechanism as a module and in its foward() pass, they simply feed the input and what not to the lstm/gru layer and get its outputs! and then use that as the hidden states and multiply/etc it with the encoders states and carry on the rest of the formula.

The idea was to prior to creating an output in each timestep in the decoder , its current/previous hidden state be used. This is not the case here, the lstm layer simply just does its job, produces all outputs normally and we are are using these hiddenstates in each iteration(as apposed to in each sample) .

The hidden states were supposed to belong to one sample(one german sentence/paragraph to one english sentence/parageraph) . what happens here is we are using the hidden states from previous iterations that has nothing to do with the current sample! (its as if for translating the current sentence, I’m looking at the hidden state for previous sentence (as if for translating the sentence : hey buddy whats up?! I look into the previous sample sentence, mama mia beat the hell out of his son Jose!)

Can someone please explain whats wrong here? am I missing something here?

What seems to happen is that (at least in the tutorial) the functions doing the decoding are only taking one time step (and for GRU output == new hidden) and the loop over the outputs is in the train function. One might argue that it’d be more PyTorchy to wrap this loop into a Module. Using a cell or a model capable of multiple steps but feeding in sequence length 1 isn’t crucial here.

Best regards


1 Like

Thanks a lot. I really appreciate it. I completely forgot about the training, it didnt even occured to me once, they might have done such a thing! Is there any reason that you can think of why they used this while they could easily implement the for loop inside the module ?
Is there any kind of down side for doing that instead of the official example?