Does anyone have sample code for batched multiple packed sequences with Attention?
Here's where I am confused:
Background I have a data set with a series of sentence tuples (s1, s2). I have a Bidi lstm for each of them and I would like to attend on the two sequences s1, and s2 before I send them off to a linear layer -- I would also like to do this in a batch setting though the pseudo code below is written in a per instance (i.e. per s1,s2 tuple) forward pass.
Something like this (not working code -- pseudo code)
combined_input_2_attention=torch.cat((s1_lstm_out, s2_lstm_out), 0)
attn_applied = torch.bmm(attention_alphas.unsqueeze(0),
where s1_lstm_out and s2_lstm_out are the outputs of sending one s1, s2 tuple in the forward pass.
Q1. If this was batched -- Do the attention weights (alphas) need to be in dimensions of length of the max sequence length (s1+s2 since I am concatenating) per batch(?) or globally?. I doubt this is something can be per batch -- because how would i initialize the dimensions of the linear layer that does the attention?
Q2 Either way I need to have a packed sequence for the two LSTMs before I attend over them or concatenate them -- but the problem is padding requires them to be sorted individually.
I read the answer of using sort but I could not follow it completely.
Here's what I understood from the post above:
1) I take a set of tuples (s1, s2). Sort them individually using tensor.sort and keep track of their original indices.
2) Individually pack s1 and s2 into padded sequences based on their individual max lengths per batch and send it to my forward pass
3) At the end of it -- I return my outputs by reverse mapping them based on the sorted order generated in (1)? If yes, wouldn't the network have seen an instance of s1 and s2 in different orders from my original -- why is this correct/why does this work? What am I missing?
Any leads /clarification would be extremely helpful! Thanks!