Right vs Left Padding

According to my understanding, GPT-2 (or decoder base models) is a causal (left-to-right) language model that always predicts the next token using the last position (logits[:, -1]).

Right-padding (problem in GPT-2 style models):
Sentence A: [The, cat, sat, PAD, PAD]
Sentence B: [The, dog, ran, very, fast]
→ GPT-2 predicts from the final position.

For Sentence B → correct (after “fast”)
For Sentence A → wrong, because the last positions are just PAD tokens

So GPT-2 ends up predicting based on fake padding context, which leads to bad generation despite low training loss (exactly the issue you’re seeing).

Left-padding (correct for batching with GPT-2):
Sentence A: [PAD, PAD, The, cat, sat]
Sentence B: [The, dog, ran, very, fast]

Now both sequences end at the same index, so GPT-2’s logits[:, -1] always corresponds to the true last token.