Hello all
I’m currently working on a project using BERT (Bidirectional Encoder Representations from Transformers). The model is designed to output binary classification, where each instance can be classified into one of two possible classes. In the case of idiom recognition, the model is trained to classify each instance as either an idiom or not an idiom.
PIE Corpus Dataset: [2006.09479] EPIE Dataset: A Corpus For Possible Idiomatic Expressions
This dataset contains possible idiomatic expressions instances from 717 idioms divided into two folders:
Formal Idioms - Idioms which undergo lexical changes.Static Idioms - Idioms which stay the same across instances.
Each folder contains 3 sentence aligned files with ‘*’ replaced with either ‘Static_Idioms’ or ‘Formal_Idioms’ *_Words.txt :- Original Sentences *_Candidates.txt :- Candidate Idiom whose instance is present in the corresponding sentence. *_Tags.txt :- Sequence labelling tags for each token of the sentence.
Here’s my problem. I’m getting this error when trying to train the BERT model:
188 def __init__(self, *tensors: Tensor) -> None:
--> 189 assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors), "Size mismatch between tensors"
190 self.tensors = tensors
191
AssertionError: Size mismatch between tensors
I understand that this error message indicates that there is a size mismatch between the input tensors. Specifically, the first dimension of all the input tensors should be the same, but they are not. I’ve checked the model node output: Linear(in_features=768, out_features=2, bias=True)
and the shapes:
print(input_ids.shape)
print(attention_masks.shape)
print(labels.shape)
which would output:
torch.Size([1, 512])
torch.Size([1, 512])
torch.Size([3136, 1])
I’m using a DataLoader to create batches of input data however i’m now having doubts whether I’m correctly tokenizing the input sequences or If I need to resize the tensors (if that’s the case I don’t know how to go about it).
Here are the 2 sections of code I wanted to shed light on to see if anyone can catch something funky:
corpus_path = "Formal_Idioms_Corpus"
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if f.endswith(".txt")]
corpus_texts = []
for file_path in corpus_files:
with open(file_path, "r") as f:
corpus_texts.append(f.read())
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 512
# tokenize input sequences
tokenized_texts = []
for file_path in corpus_files:
if file_path.endswith("_Words.txt"):
with open(file_path, "r") as f:
text = f.read()
tokenized_text = tokenizer.tokenize(text)[:max_length]
tokenized_texts.append(tokenized_text)
print(tokenized_texts)
# pad and truncate sequences
input_ids = []
attention_masks = []
for text in tokenized_texts:
encoded_dict = tokenizer.encode_plus(text, add_special_tokens=True, max_length=max_length, pad_to_max_length=True, return_attention_mask=True)
input_ids.append(encoded_dict['input_ids'])
attention_masks.append(encoded_dict['attention_mask'])
input_ids = torch.tensor(input_ids, dtype=torch.long)
attention_masks = torch.tensor(attention_masks, dtype=torch.long)
and this is the training section:
from torch.utils.data import Dataset, TensorDataset
# set the batch size and create a PyTorch DataLoader
batch_size = 16
data = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True)
# set the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
loss_fn = nn.CrossEntropyLoss()
# train the model
model.train()
for epoch in range(num_epochs):
for batch in dataloader:
input_ids_batch = batch[0].to(device)
attention_masks_batch = batch[1].to(device)
labels_batch = batch[2].squeeze().to(device) # make sure to squeeze the labels tensor
outputs = model(input_ids_batch, attention_masks_batch)
logits = outputs[0]
print('logits shape:', logits.shape)
print('labels_batch shape:', labels_batch.shape)
loss = loss_fn(logits, labels_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Thanks for taking the time to read this — this my first time using the BERT model as well as using a corpus. I also wouldn’t mind sharing the google collab notebook on github if anyone wants to help out!