Batched seq2seq in pytorch

The implementation looks incorrect to me. The main problem seems to be that the loss treats the padded target sequence always as the correct ones and try to learn to predict the pad (which, in the best case, seems at least not very useful); this may be related to the behavior you have observed during test.

Personally, I think the problem can be tackled by optimizing a masked_loss, Pytorch doesn’t seem to have native one yet, so may require some workaround. This discussion seems related How can i compute seq2seq loss using mask?