My feeling is that this is not possible. During training I believe this possible because we are feeding the transformer the true tokens (basically doing teacher forcing). However, during testing we don’t have the truth and we have to truly be auto-regressive so we have to really generate one token at a time. Is this right?
Note this is a code example showing what I believe the common transformer test time:
for i in range(max_len):
trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(self.config.device)
trg_mask = self.model.make_trg_mask(trg_tensor)
with torch.no_grad():
output, attention = self.model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
pred_token = output.argmax(2)[:,-1].item()
trg_indexes.append(pred_token)
if pred_token == self.config.trg_vocab.eos_idx:
break
other example and discussion: How to vectorize decoder translation in transformer? · Issue #1 · azadyasar/NeuralMachineTranslation · GitHub