Training a generative model without teacher forcing

I understand this is more a theoretical question, but would there be any reason to train an RNN or transformer architecture without using teacher forcing, that is without using the target to guide predictions for such generative models during training? Wouldn’t doing so bypass the train-test discrepency known as exposure bias or covariate/distribution shift, where at test time the model outputs are autoregressively fed back in as input as opposed to the targets seen during training? A couple of obvious downsides to this approach is the serial processing requirement which slows down training considerably, and that the inputs to the model are changing at each epoch which may affect its ability to converge.

Training a model without teacher forcing also creates an additional relationship between the model’s parameters and its ultimate prediction. Should we be backpropagating through these predictions made by the model? Assume in this case that the outputs of the sequence model are real-valued/continuous (MSE loss) and that the network is fully differentiable.

Some related dicussions:
Notes on Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Correct way to train without teacher forcing
Train transformer without teacher forcing

I have no practical experience with RNN architectures, but I can speak for Transformer architectures. Simply speaking, it is possible to train a good model without teacher forcing method, and it may lead to some interesting results because the model kinda learns on its mistakes. However, it will come with costs. Specifically, convergence in the beginning of training will be awful that will lead to longer training time. And this is why you should use the scheduled teacher forcing method, where in the beginning of the training the model is heavily teacher forced but closer to the end not so much.

Transformers with scheduled sampling implementation (PyTorch discussions)
The Paper (Scheduled Sampling For Transformers)

Hopefully this helps.

1 Like

There is also a tutorial on attention in the PyTorch tutorials page and there you can see that teacher forcing is used 50% of the time (randomly). So yes, you can do it. The in that tutorial a GRU is used but you can use any LM (GPT etc). I would think per batch/data pair the model learns more via teacher forcing, but I might circle back on this after looking into the links. I would think that yes at the start it’s best to use it, and later you can phase it out.