Adding to what @Kushaj says, special symbols are usually added to the sequence when forming them, in most cases: <BOS> target_senquence <EOS>
Now in production, once we have the encoder output for the input sequence so we want to predict the output (encoder_output
), the first thing we give to the decoder is: source = encoder_output
and target = <BOS>
With that it will predict the first token (token_1
) of the output sequence, and the next thing we give to the decoder is: source = encoder_output
and target = <BOS> token_1
.
It will output a second token, and so on until it outputs <EOS>
, and the decoding process stops: the sequence between <BOS>
and <EOS>
will be our prediction.
In practice, to predict each token generally, we don’t take only one token that maximizes the probability at the output of the model, but k (top_k
) tokens that maximize it: thus several possible outputs are exploited before choosing the one with the maximum probability.
The idea of beam search here is that, it is not necessarily because the current token produced by the model has maximized the output probability, that it will allow to have the final output sequence that maximizes the probability: several paths are thus exploited to hope to be able to make the good choice.