Right vs Left Padding

While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use?
Many papers use Left Padding, but is right padding wrong since transformers gives the following warning if using right padding " A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer."

1 Like

I would simply assume the model was pretrained with left-padded input, so it seems reasonable to adhere to it for fine-tuning.

Left-padding feels also more intuitive :slight_smile:

Thanks for the reply, foundational models like LLaMA are usually trained without padding.
It is mostly used during finetuning, also can you please tell why left padding feels more intuitive? Thanks!

I’m concerned about this as well, and the generate() method in the transformers library explicitly suggests that decoder-only models should use the left padding method. I would also like to know the reason for this

1 Like

I asked the same question on Stack overflow and got a good answer.
transformer - While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use? - Artificial Intelligence Stack Exchange

I have been thinking about this for quite some time. How do they manage to train without padding if they train with batch of variable length observations?. Or maybe, they just chunk the whole text corpus into size of equal lengths and train like that. In that setup, there is no longer the problem of variable input length. However, is this a correct approach? What if the semantically related sentence ended up in the other batch?

It is the latter, for pre-training LLMs they collect a large corpus and as you said, divide it into segments of equal length. This can bundle two unrelated paragraphs together but when because the models are trained on trillions of tokens it doesn’t affect the model much. Also, there is the other option of masking out the unrelated text.

1 Like