I want to mask output of the seq2seq(decoder) in my case I need only last time step of the decoder. As sequences have different lengths in the batch I use padding. I can’t use
pack_padded_sequence because of lengths can’t be in a decreasing order.
So I mask output in such way:
output_mask = create_mask(batch_size, lengths_dec, len(dec_outputs))
masked_outputs = torch.masked_select(dec_outputs, output_mask)
masked_outputs = masked_outputs.view((-1, self.hidden_size))
Finally, I get only last time steps in the batch which I put to the linear layer.
Can I use this approach? Can you give me a hint if there is an easier way to do so?
I don’t think your solution would work: the dimension you want to infer with
view is not necessarily a multiple of hidden_size, since the size of masked_outputs is the number of non-zeros elements in the mask. In fact, this size can be anything depending on your data.
What you can do is to use the masks at the very step of your loss computation, and before this step, keep everything at the dimensions of your dec_outputs. Since in a seq2seq model your targets and outputs would have the same masks, you can simply do:
masked_outputs = torch.masked_select(dec_outputs, mask)
masked_targets = torch.masked_select(targets, mask)
loss = my_criterion(masked_outputs, masked_targets)
@alexis-jacq Thank you for your response!
All is ok with shapes, it works. The question about gradient, does it flow correctly? It seems that not because the model converges with
batch size = 1 but with
batch size > 1 it doesn’t. Can someone help?
Also, my decoder is Many-to-One LSTM so it is not standard seq2seq and my labels have
shape = [batch_size] so I don’t need to mask labels
@VladislavPrh. Have you found the solution ? I am also using the solution by @alexis-jacq. But for batch_size > 1, the model doesn’t converge
The problem was with sorting: I forgot to sort labels by the encoder