Is it possible to vectorize the inference time of a transformer when generating a new sequence?

My feeling is that this is not possible. During training I believe this possible because we are feeding the transformer the true tokens (basically doing teacher forcing). However, during testing we don’t have the truth and we have to truly be auto-regressive so we have to really generate one token at a time. Is this right?

Note this is a code example showing what I believe the common transformer test time:

-NeuralMachineTranslation/translator.py at 7fd9450c88d833748218c1678124cc67e3303065 · azadyasar/NeuralMachineTranslation · GitHub

    for i in range(max_len):
      trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(self.config.device)
      trg_mask = self.model.make_trg_mask(trg_tensor)
      with torch.no_grad():
        output, attention = self.model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
      pred_token = output.argmax(2)[:,-1].item()
      trg_indexes.append(pred_token)

      if pred_token == self.config.trg_vocab.eos_idx:
        break

other example and discussion: How to vectorize decoder translation in transformer? · Issue #1 · azadyasar/NeuralMachineTranslation · GitHub