Dataloader datalength not fitting for last batch

Hey, I’m new to ML. I have a small dataset and I want to make the most of my RTX 3080. Therefore, I’m trying to use a large batch size of 64. When I used a batch size of 8, it took 97 minutes, but with a batch size of 64, it only took 65 minutes. However, I’m concerned about potential data loss.

The following code prints the number of required images (for complete batch usage) and the actual number of available images:

print(len(train_loader) * 64, train_data_len)
print(len(validation_loader) * 64, val_data_len)
print(len(test_loader) * 64, test_data_len)

Output:

7616 7563
896 841
960 924

With a batch size of 64, I have 118 iterations to process the entire dataset, which means I need 7616 images. However, I only have 7563 images available for training. What happens to the last batch with just 7616 - 7563 = 53 images?

Will it be ignored or filled with random data?

The last batch isn’t dropped unless you set drop_last=True while instantiating the DataLoader.

1 Like

If I may ask one more small question, is there a drawback to using a relatively large batch size to speed up training with such a small dataset, assuming that no data is lost as I initially thought?

It shouldn’t be.
Smaller batch sizes might help with regularization and prevent overfitting, but this is just a technique that you could try if you aren’t able to regularize using other techniques.

1 Like