With a Transformer, can teacher forcing be done in Parallel?

I have a sequence to sequence model with nn.Transformer and in my training loop, I have:

        for i in range(x.size(1)):
            pred = model(x, tgt[:, 0:i + 1, :])
            loss = loss_fn(pred, ground_truth[:, 0:i + 1, :])
            total_loss += loss.item()
            cnt += 1
        return total_loss / cnt

Where x.size() is [batch, seq, features]. So my sequence increases in each iteration, but of course, this takes a LONG time Can this be done in parallel somehow?

Hi, did you manage to find a way to parallelize the loop? I am also looking for a way to do that. Thanks!