Length of target tensor in seq2seq example

It mgiht very well that I am misunderstanding something here, but I am following the official seq2seq tutorial and I am unsure about the following section:

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

I assume that all input and output sequences start with the SOS token and end with the EOS token, both on the source and the target side.

As you can see, the SOS token is given as input and then target_length new tokens are generated and used to calculate loss. It seems to me that this means that one token too many will be generated (and used in calculating loss) because the first token SOS has already been given. this also means that the loss function compares the wrong indices: I think it should be target_tensor[di+1] because otherwise you are one token late, as we shifted one index forward because we already started with SOS.

Am I wrong? I wasn’t sure whether this should be posted as an issue on Github or here, so first trying to post it here.

I currently don’t understand ur question.
R u confused on the length of the output or the output indexing given that the token ‘sos’ is part of the input?

In the output, we start the decoder off with the SOS token, so it already has one output token. But then we still run for di in range(target_length). So on top of the SOS token, we still predict target_length tokens. That means that you predict one more token than there are in the actual output.

Maybe it’s clearer with a the most basic example of predicting a target (e.g., translating “stopp” (German) to “stop”). In this case your target_tensor = ['stop', '<EOS>'] and target_length = 2. So this gives you two iterations:

  1. decoder_hidden_1 + <SOS> ==> decoder_hidden_2 + "stop"
  2. decoder_hidden_2 + "stop" ==> decoder_hidden_3 + <EOS>

Assuming the all predictions are correct or teacher_forcing = True. I thing the number or iterations work out just fine. Where do you think it fails w.r.t. to this example?

I think my issue then lies with preprocessing. I thought both input and output sequences had to start with SOS and end with EOS, but from what you are saying it seems that the input should contain SOS and EOS but output should only contain EOS?

EDIT: After going through the tutorial again, I found that they indeed only add EOS tokens in preprocessing and SOS is only added in the decoder. I always thought, having experience from language models, that SOS would also be added in the input.

Yes, only the EOS token is really important for the classifier to be able to predict the end of a sentence. The SOS token is only used to kick-off the decoding, and therefore does not need to be added to the sequences.

Thanks Chris (again)!