Machine translation in just transformer-decoder

I’m solving a rare machine translation problem, where I map a robotic piece of music (imagine each note played for the exact duration + volume) to an actual human performance of such piece. I was hoping to use a transformer encoder-decoder architecture, but tokenization gives sequences with 15,000 tokens on average, which is too computationally expensive to manipulate with my budget.

However, all of the tokenized examples have the input robotic piece almost the same length as the output performance. Could I just use the transformer decoder architecture to map between the two, and use special tokens to pad to ensure they’re the same length?

  • e.g. if len(input) < len(output), pad input to output length with tokens
  • e.g. if len(input) > len(output), pad output to input length with tokens