Transformer tgt argument


I am trying to use the Transformer for the first time as an encoder-decoder for a tagging task. I am finding the input arg tgt a bit underdescribed. The docs give us: tgt – the sequence to the decoder (required).

Is this intended to be the embedding for the previously decoded timestep? A tensor of all previously decoded timesteps? Also, given that there is a tgt_mask input, is this simply for batching, if one sample stops decoding, but the others continue? Or does this imply that at train time it is preferable to get the entire gold target sequence, with outputs that have not been decoded yet masked? I cannot think of another use case for masking the attn between tgt timesteps. Even then, at test time, what would you be providing the model since we do not have a gold sequence?

Could someone point me to any examples that clarify the usage of tgt and tgt_mask?

In the basic case, my assumption is we want (if using student forcing) something like:

for i in range(0, sequence_length):
    out = transformer(input_representation,

    # self.out encodes into out space
    # Softmax to compute cross-entropy or take max as prediction
    preds = self.softmax(self.out(preds))
    _, indices = torch.max(preds)