ex: 10 X 253 X 768 … This tensor X is from the output of an LSTM.
There is a another tensor Y, of shape:
batch_size X number_of_sequence X embedding_size
ex: 10 X 253 X 300 … This tensor Y is the output from an Embedding.
I need to work with these 2 tensors X and Y and feed this to an Attention network to match the sequence Y to each element in X. I need bit help as to which operations would be better to pack X and Y … I mean, if I do torch.cat((X, Y), dim=2) will this be a good idea?
The dimensions of these 2 tensors X and Y are different and for my case, I want the dimension of X. So, what I can do is put Y into an LSTM and get this as the same dimension as of X.
Then I can do a point-wise addition between X and Y to get an unified-weighted tensor.
Can you share will point-wise addition would be a good operation in this case? or there are other better operations?
I have different types of attention mechanisms. (one-directional, bi-directional etc)
similar to
Mostly, they create an affinity matrix with M x N (M = sequence length of X, N = sequence length of Y) by a dot (vector) product between X & Y and use this affinity matrix to summarise X and/or Y.
I am working on a question-answering task. Instead of CNN, the input is from a word embedding.
So, I have 2 tensors.
An output from an LSTM (in the figure attached, this is the output from LSTM in First attention to x2_1) that holds the representations of:
1.1 the input words (X_1) and other features.
1.2 the question embedding on the input words.
Let’s say this tensor is hidden_docs. The shape of this tensor is batch_size X sequence_length X hidden_size. For example, hiddens_docs = tensor(10, 253, 768)
X_1 is the input words representation.
Let’s say this tensor is x1_emb. The shape of this tensor is batch_size X sequence_length X embedding_size. For example, x1_emb = tensor(10, 253, 300)
I need to project x1_emb on hidden_docs to get a unified representation between hidden_docs and x1_emb. And this representation then needs to be feed into x2_1 (in the ‘Attention again’).
Currently, what I am doing is since the shape is different, I am passing the x1_emb into an LSTM and able to get the shape as = batch_size X sequence_length X hidden_size.
And then I am doing a point-wise addition operation between hidden_docs and x1_emb before passing the output to x2_1. Now here I am bit confused.
Is the operation - point-wise addition between these two tensors is right operations? are there other better operations to do to capture the better representations between hidden_docs and x1_emb?
I understand. I’m not sure if I could say what operations are right and wrong. In general, I have seen papers doing addition as well as dot product based attention mechanism. You can only evaluate it empirically I guess.