In the encoder part, I have used packed sequence like:
packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
In the decoder part, during training, I generate output one by one up to MAX_LENGTH, and then use maskloss like this:
crossEntropy = -torch.log(torch.gather(input, 1, target.view(-1, 1))) loss = crossEntropy.masked_select(mask).mean()
When sampling, an input sentence is given and then generate output sentence using beam search.
Why I still generate ‘PAD’ before EOS? ps: I pad the training data after the EOS token.
Is it normal???