Problem: During my fine-tuning process with BERT on the GLUE mrpc dataset, I encountered the following issue:
RuntimeError: stack expects each tensor to be equal size, but got [100] at entry 0 and [89] at entry 1
This is my code:
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"],
example["sentence2"],
padding=True,
truncation=True,
return_tensors='pt')
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], batch_size=8,
)
# the code snippet that resulted in an error
for batch in train_dataloader:
break
{k: v.shape for k, v in batch.items()}
I’m wondering why this error is occurring and how I can enhance my code to rectify it. Any guidance or insights would be greatly appreciated!