Size mismatch between tensors - Using BERT model for binary classification

Hello all :slight_smile:

I’m currently working on a project using BERT (Bidirectional Encoder Representations from Transformers). The model is designed to output binary classification, where each instance can be classified into one of two possible classes. In the case of idiom recognition, the model is trained to classify each instance as either an idiom or not an idiom.

PIE Corpus Dataset: [2006.09479] EPIE Dataset: A Corpus For Possible Idiomatic Expressions

This dataset contains possible idiomatic expressions instances from 717 idioms divided into two folders:

Formal Idioms - Idioms which undergo lexical changes.Static Idioms - Idioms which stay the same across instances.

Each folder contains 3 sentence aligned files with ‘*’ replaced with either ‘Static_Idioms’ or ‘Formal_Idioms’ *_Words.txt :- Original Sentences *_Candidates.txt :- Candidate Idiom whose instance is present in the corresponding sentence. *_Tags.txt :- Sequence labelling tags for each token of the sentence.

Here’s my problem. I’m getting this error when trying to train the BERT model:

    188     def __init__(self, *tensors: Tensor) -> None:
--> 189         assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors), "Size mismatch between tensors"
    190         self.tensors = tensors

AssertionError: Size mismatch between tensors

I understand that this error message indicates that there is a size mismatch between the input tensors. Specifically, the first dimension of all the input tensors should be the same, but they are not. I’ve checked the model node output: Linear(in_features=768, out_features=2, bias=True)

and the shapes:

which would output: 

torch.Size([1, 512])
torch.Size([1, 512])
torch.Size([3136, 1])

I’m using a DataLoader to create batches of input data however i’m now having doubts whether I’m correctly tokenizing the input sequences or If I need to resize the tensors (if that’s the case I don’t know how to go about it).

Here are the 2 sections of code I wanted to shed light on to see if anyone can catch something funky:

corpus_path = "Formal_Idioms_Corpus"
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if f.endswith(".txt")]

corpus_texts = []
for file_path in corpus_files:
    with open(file_path, "r") as f:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

max_length = 512

# tokenize input sequences
tokenized_texts = []
for file_path in corpus_files:
    if file_path.endswith("_Words.txt"):
        with open(file_path, "r") as f:
            text =
        tokenized_text = tokenizer.tokenize(text)[:max_length]


# pad and truncate sequences
input_ids = []
attention_masks = []
for text in tokenized_texts:
    encoded_dict = tokenizer.encode_plus(text, add_special_tokens=True, max_length=max_length, pad_to_max_length=True, return_attention_mask=True)

input_ids = torch.tensor(input_ids, dtype=torch.long)
attention_masks = torch.tensor(attention_masks, dtype=torch.long)

and this is the training section:

from import Dataset, TensorDataset

# set the batch size and create a PyTorch DataLoader
batch_size = 16

data = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True)

# set the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
loss_fn = nn.CrossEntropyLoss()

# train the model
for epoch in range(num_epochs):
    for batch in dataloader:
        input_ids_batch = batch[0].to(device)
        attention_masks_batch = batch[1].to(device)
        labels_batch = batch[2].squeeze().to(device) # make sure to squeeze the labels tensor
        outputs = model(input_ids_batch, attention_masks_batch)
        logits = outputs[0]
        print('logits shape:', logits.shape)
        print('labels_batch shape:', labels_batch.shape)
        loss = loss_fn(logits, labels_batch)

Thanks for taking the time to read this — this my first time using the BERT model as well as using a corpus. I also wouldn’t mind sharing the google collab notebook on github if anyone wants to help out! :star2:

I assume you are seeing the error when the forward pass is called in:

outputs = model(input_ids_batch, attention_masks_batch)

If so, then note that the batch size seems to be 1 in your previous comment:

#which would output: 

torch.Size([1, 512])
torch.Size([1, 512])
torch.Size([3136, 1])

(let’s ignore labels for now) while you are setting batch_size=16 in your DataLoader, so something doesn’t seem right.
Could you add print statements to the DataLoader loop and print the shapes of both input tensors in each iteration and check if all batch sizes are equal?

Hi ptrblck! Thank you for taking the time time comment on my post, I really appreciate it! This is driving me crazy lol

So here’s the snippet of code (ignoring labels) and printing statement to DataLoader.

batch_size = 16

data = TensorDataset(input_ids, attention_masks)
# Use the DataLoader class to create batches of the Dataset
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True)

for batch in dataloader:
    inputs = batch[0]
    attention_masks = batch[1]

    print("Input shape: ", inputs.shape)
    print("Attention mask shape: ", attention_masks.shape)

This was the output:
Input shape: torch.Size([1, 512])
Attention mask shape: torch.Size([1, 512])

Assuming this code prints a single output it would mean that only a single batch with a single sample is available.
What does print(len(dataloader)) and print(len(data)) return?

They both print out 1

This would mean that only a single sample is available in the entire dataset, which sounds wrong.
Also, I don’t understand how:

AssertionError: Size mismatch between tensors

could be raised as it seems at least both inputs have the same batch size.
I would guess tokenized_texts is also a list containing a single sample?

Hi ptrblck,
If that’s the case, would this suggest that only one sample is being processed at a time, rather than being batched together? It does sounds wonky…

Yes, I also just double checked on my end - There is only one sample in tokenized_texts.

Yes, only one sample is processed but the main issue is that also your dataset only contains a single sample, which means there are no more samples to load and process.

OK, I also guess this is not expected?
If so, then I would recommend checking why only a single sample is loaded and I would guess either corpus_files also contains a single file or the if file_path.endswith("_Words.txt") condition fails.

Ahhh! So, i’ve decided to play around with the code & made some modifications where all the texts files in the Formal_Idioms_Corpus folder are read instead of the `_Words.txt ’ file.

corpus_path = "Formal_Idioms_Corpus/"

# create a list of file paths for all *_Words.txt files in the corpus
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if f.endswith(".txt")]

this results in multiple samples in tokenized_texts.
New input shape:
Input shape: torch.Size([4, 512])
Attention mask shape: torch.Size([4, 512])

Update: I realized that I’ve missed a crucial part during preprocessing & that’s what caused the error. So I had to load the data from the three text files instead of one in the corpus and split the sentences into individual words/tokens (why did I not realize this before). For each sentence, i just needed to create a list of candidate idioms and their corresponding tags then convert list into a feature vector that can be fed into the model. I double checked sizes and everything matched!

torch.Size([81289, 64])
torch.Size([81289, 64])

Currently in the process of training the model (it’s taking a while though – so far i’m 36 minutes in running it on GPU).

Update2: after training and evaluating the model on the validation set… the training accuracy is reported as 0.000, which doesn’t seem right (maybe the model is over fitting the training data or i have to modify the architecture)…

Epoch 3: Train loss: 0.000 Train accuracy: 0.000 Validation accuracy: 1.000
Test accuracy: 1.000