Transformer to reorder sequences

My goal is to reorder input sequences like [1,1,1,2,2,3,4,5,6,6] to e.g. [0,0,6,6,0,0,6,2,2,1,5,0,0,1,12,3,4] to simulate an industrial process.
Each token in the input should also be in the output + some 0s at which time step the process has no output.

I did it similar to that Transformer and it learns pretty good.

But especially with long sequences, sometimes tokens are missing from the output or sometimes there are some that are not in the input. This is really bad for my application.

Is there any way I can restrict the output so that it can only generate each token exactly once from the input sequence?

What I was thinking about:

  • Using beam search for decoding and cutting out all paths that are not valid (Is this possible during training? (My model outputs logits, which I am currently using for CrossEntropyLoss))
  • use different losses, e.g. JARO_WINKLER_LOSS or LEVENSHTEIN_LOSS (edit - distance) or just something that counts the number of same and different tokens, but with all these comparison methods I have the problem that they do not have gradients.

I would be very grateful for any ideas, tips or links on how I could solve this.