I’m trying to add an attention mechanism over an LSTM encoder decoder. If I understand correctly, the idea is to calculate a context vector at every time step of the decoder and use that along with the previous predicted output word to predict the next word.
Now, an LSTM takes as input the previous hidden, cell states and an input vector. Therefore I have to combine the last predicted word vector and the context vector before feeding to the lstm. Is that correct? If so what is the standard way of doing that? Something like this?
non_linear_function(torch.mm(weight1, prev_output) + torch.mm(weight2, context_vector))