Hi. I usually use a custom collate_fn
and use it as an argument when defining my DataLoader. It usually looks something like:
def collate_fn(batch):
max_len = max([len(b['input_ids']) for b in batch])
input_ids = [b['input_ids'] + ([0] * (max_len - len(b['input_ids'])))]
labels = [b['label'] for b in batch]
return input_ids
As you can see, I’m using 0
for my padding sequence. What I’m wondering is, since language models and their tokenizers use different IDs for padding tokens, is there a way that I can make the collate_fn
flexible to take that into account?