Teacher forcing ratio

MLangner · November 8, 2024, 11:35am

Hello there,

In regard to training an RNN for NLP with LSTM cells, I wonder whether exposure bias is somehow changed when modifying the teacher forcing scheme.
Do I choose per sequence (for the batch if you will) or per decoder iteration whether to use teacher forcing or not? the first would mean that some input sequences are fully teacher forced, while other sequences are processed with feeding in its own predictions only. In the latter version, the random variable for teacher forcing is inside the decoder iteration, which means that every sequence gets teacher-forced at some steps and not at other steps by proportion x.
Any suggestions for the ratio? So far, I set it to 0.5, but before training at large, I wanted to ask whether this is adequate or whether I should choose a lower ratio.

Thanks in advance!