I am trying to use the
Transformer for the first time as an encoder-decoder for a tagging task. I am finding the input arg
tgt a bit underdescribed. The docs give us:
tgt – the sequence to the decoder (required).
Is this intended to be the embedding for the previously decoded timestep? A tensor of all previously decoded timesteps? Also, given that there is a
tgt_mask input, is this simply for batching, if one sample stops decoding, but the others continue? Or does this imply that at train time it is preferable to get the entire gold target sequence, with outputs that have not been decoded yet masked? I cannot think of another use case for masking the attn between tgt timesteps. Even then, at test time, what would you be providing the model since we do not have a gold sequence?
Could someone point me to any examples that clarify the usage of