Transformer decoder training. Batch slice

jubick · July 3, 2021, 3:44pm

So for the transformer decoder we want to feed embedded tokens but not EOS tokens.
Many tutorials do it like x[:,:-1] or x[:-1,:] (BxL or LxB B-batch size N-sequence length)
It makes sense for one sentence “BOS Bob is cool EOS” → “BOS Bob is cool” as we don’t want anything to be predicted after last passed token.
But in batch this isn’t always true.
I don’t understand how to do this slice in batch where not all of the sequences got same length.
For example “BOS Bob is cool EOS” → “BOS Bob is cool” but “BOS Need help EOS PAD” → “BOS Need help EOS” if we simply slice last element.
Surely I’ve added ignore_index to loss but I’m not sure if that’s all we need.
Any help appreciated!

jimmykobe · January 30, 2022, 8:07am

May I ask have you solved this issue?