Using cross_entropy's ignore_index, instead of pack_padded_sequence

The computed loss on some position let’s call it t is ignored only if true label on that position t is equal to the specified ignore_index, so if you match true labels that are equal to ignore_index with padding positions then you can ignore those predictions (if you match them correctly somehow). There is no magic between these 2 concepts i.e. nn.CrossEntropyLoss and ignore_index are unrelated to padding.

It’s possible that I misunderstood you (there are all different kinds of language modeling). What’s your input/output?