Inference on a Sequence-2-Sequence model with teacher forcing

krishnab · May 19, 2021, 11:36pm

I was watching some very good videos by Aladdin Persson on Youtube, and he shows a simple Sequence-2-Sequence model for machine translation + Teacher Forcing. Now technically I adapted this model for time-series analysis, but the example is fine. The original code is below. The key issues is that due to Teacher Forcing, in the Seq2Seq layer, the forward() method takes both the input sentence and the label–meaning the correct answer.

My question is, in the case of actual inference on the model, I won’t have a label. During inference I will only have the input sentence. So when trying to run the model, the model function will expect model(input, label), and we won’t have any label to provide. So what is the way to deal with that?

Here is the code.

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_force_ratio=0.5):
        batch_size = source.shape[1]
        target_len = target.shape[0]
        target_vocab_size = len(english.vocab)

        outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)

        hidden, cell = self.encoder(source)

        # Grab the first input to the Decoder which will be <SOS> token
        x = target[0]

        for t in range(1, target_len):
            # Use previous hidden, cell as context from encoder at start
            output, hidden, cell = self.decoder(x, hidden, cell)

            # Store next output prediction
            outputs[t] = output

            # Get the best word the Decoder predicted (index in the vocabulary)
            best_guess = output.argmax(1)

            # With probability of teacher_force_ratio we take the actual next word
            # otherwise we take the word that the Decoder predicted it to be.
            # Teacher Forcing is used so that the model gets used to seeing
            # similar inputs at training and testing time, if teacher forcing is 1
            # then inputs at test time might be completely different than what the
            # network is used to. This was a long comment.
            x = target[t] if random.random() < teacher_force_ratio else best_guess

        return outputs

As you can see, the forward() function takes a source, target, where the source is the input sentence and the target is the actually translated sentence. I have to use the model as below.

model = Seq2Seq(encoder_net, decoder_net).to(device)
prediction = model(data, label)

Can anyone explain how to do inference on a Sequence-to-Sequence model, or if there is a better way to train or write these models to deal with teacher forcing, etc. Thanks.

Kushaj · May 20, 2021, 4:18am

During inference, you use the encoder as normal. For decoder, you pass the input (output of encoder) and then the output of decoder is used as input for next timestamp (and repeat).

So in your seq2seq imagine an arrow going from decoder output at step t to decoder input at step t+1.

pascal_notsawo · May 20, 2021, 6:51am

Adding to what @Kushaj says, special symbols are usually added to the sequence when forming them, in most cases: <BOS> target_senquence <EOS>

Now in production, once we have the encoder output for the input sequence so we want to predict the output (encoder_output), the first thing we give to the decoder is: source = encoder_output and target = <BOS>

With that it will predict the first token (token_1) of the output sequence, and the next thing we give to the decoder is: source = encoder_output and target = <BOS> token_1.
It will output a second token, and so on until it outputs <EOS>, and the decoding process stops: the sequence between <BOS> and <EOS> will be our prediction.

In practice, to predict each token generally, we don’t take only one token that maximizes the probability at the output of the model, but k (top_k) tokens that maximize it: thus several possible outputs are exploited before choosing the one with the maximum probability.
The idea of beam search here is that, it is not necessarily because the current token produced by the model has maximized the output probability, that it will allow to have the final output sequence that maximizes the probability: several paths are thus exploited to hope to be able to make the good choice.