How to feed different pad IDs to a collate function?

seankala · August 26, 2022, 12:47am

Hi. I usually use a custom collate_fn and use it as an argument when defining my DataLoader. It usually looks something like:

def collate_fn(batch):
    max_len = max([len(b['input_ids']) for b in batch])
    input_ids = [b['input_ids'] + ([0] * (max_len - len(b['input_ids'])))]
    labels = [b['label'] for b in batch]
    return input_ids

As you can see, I’m using 0 for my padding sequence. What I’m wondering is, since language models and their tokenizers use different IDs for padding tokens, is there a way that I can make the collate_fn flexible to take that into account?

seankala · August 26, 2022, 2:56am

I was able to make a workaround but making a Trainer class and making the collate_fn a method. After that I was able to do something like self.pad_token_id = tokenizer.pad_token_id and modify the original collate_fn to use self.pad_token_id rather than a hardcoded value.

I’m still curious if there’s any way to do this while keeping collate_fn a top-level function though. For example if there would be any way to pass an argument or something.