I’m implementing this paper with original caffe source code in pytorch.
The author talks about improving the the attention mechanism in LSTM’s, however the details are a bit obscure. check heading 2.2.2 of paper for details.
Though my understanding is the author’s have employed the same method for attention weights as is defined by this tutorial for Pytorch.
That is the attention weights are calculated using a linear layer with encoder output as input and then the concat layer for applied attention. And the attention allignment is done through the loss layer rather than any changes to the attention weights, attention vector or calculated context vector.
My question is in the first link provided, the author’s attention_LSTM layer, is the simple linear layer calculating attention weights and then continuing to a weighted input for a decoder LSTM as in the pytorch tutorial example. Or is there something else being done there too.