Dataloader return wrong batch shape

I’m having trouble while fine-tuning pretrained monolingual Roberta model.

This is my custom dataset:

class VnParaDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings)

    def __getitem__(self, index):
        item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
        return item

‘input_ids’,‘token_type_ids’,‘attention_mask’ of encodings has shape of [5440,193].

My Dataloader is:

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

However, when I loop through batches, my batch have shape of [3,193] and len(train_loader)=1:

for epoch in range(epochs):
    count = 0
    for batch in train_loader:
        print('Epoch {0} - Batch {1}'.format(epoch, count))
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = phobert(input_ids, attention_mask=attention_mask, token_type_ids=None)
        loss = outputs[0]
        count += 1

Shouldn’t the shape of one batch be [32,193] and there are 5440/32=170 batches in train_loader?

The length of a dict would return the number of keys, so you might want to use:

    def __len__(self):
        return len(self.encodings['input_ids'])