Training a generative model without teacher forcing

I have no practical experience with RNN architectures, but I can speak for Transformer architectures. Simply speaking, it is possible to train a good model without teacher forcing method, and it may lead to some interesting results because the model kinda learns on its mistakes. However, it will come with costs. Specifically, convergence in the beginning of training will be awful that will lead to longer training time. And this is why you should use the scheduled teacher forcing method, where in the beginning of the training the model is heavily teacher forced but closer to the end not so much.

Transformers with scheduled sampling implementation (PyTorch discussions)
The Paper (Scheduled Sampling For Transformers)

Hopefully this helps.

1 Like