Transformer "tgt" Question

Hi, I’m trying to do an experiment with a Transformer LLM in Pytorch.

The idea is to pass a sequence of tokens (each token is a 200 dim vector) into the transformer. The max token sequence length would 500, and the batch size would be 20.

According to the docs, the shape of the input is:
[seq_length (500), batch_size (20), feature_dims (200)].

The thing I’m trying (part of speech softmax classification) requires the decoder to output a 2 dimensional vector for each output, either [1, 0] or [0, 1]. Given this:

According to the docs, the shape of the output should be:
[seq_length (250), batch_size (20), feature_dims (2 in this case)]

But according to the docs, The output of the decoder’s feature length MUST match the input’s (200) - which is fine, because I can add a fully connected layer to get that down to 2.

BUT the decoder seems to also require a “tgt” variable for what the decoder’s targets should be. So training can’t happen any other way than the decoder trying to aim for the target vector?

Am I misunderstanding this? Why is the decoder so strict about for what & how it must be implemented?

Is there any way around this?