About the relation between batch_size and length of data_loader

chenchr · November 28, 2017, 11:38am

Hello. I want to ask about about the relation between batch_size and length of data_loader…
Assuming I have a dataset with 1000 images and set as the train loader.
During training there will be a python code like:

for i, (input, target) in enumerate(train_loader):

I not sure about that if I set the batchsize as 10, will the train_loader’s length be changed to 100? And it’s length become to 50 if I set the batchsize as 20? Or no matter how I set the batchsize, the trainloader will keep it’s length 1000?
Thank!

yashcao · November 28, 2017, 11:59am

I am not sure about your issues.
then, I hold the view that: if the data_loader is 100 (constant), and I set the batch_size = 20; I will know one iteration can fetch 20 data from data_loader, training all data needs 100/20=5 iterations.

chenchr · November 28, 2017, 12:02pm

But the number of backprop is different…

ptrblck · November 28, 2017, 12:48pm

The length of the loader will adapt to the batch_size. So if your train dataset has 1000 samples and you use a batch_size of 10, the loader will have the length 100.
Note that the last batch given from your loader can be smaller than the actual batch_size, if the dataset size is not evenly dividable by the batch_size. E.g. for 1001 samples, batch_size of 10, train_loader will have len(train_loader)=101 and the last batch will only contain 1 sample. You can avoid this by setting drop_last=True.

class MyDataset(Dataset):
    def __init__(self, size):
        self.x = torch.randn(size, 1)
    
    def __getitem__(self, index):
        return self.x[index]

    def __len__(self):
        return len(self.x)

dataset = MyDataset(1001)

data_loader = DataLoader(dataset,
                         batch_size=10)

len(data_loader)

for batch_idx, data in enumerate(data_loader):
    print 'batch idx{}, batch len {}'.format(
        batch_idx, len(data))

data_loader = DataLoader(dataset,
                     batch_size=10,
                     drop_last=True)

len(data_loader)

for batch_idx, data in enumerate(data_loader):
    print 'batch idx{}, batch len {}'.format(
        batch_idx, len(data))

chenchr · November 28, 2017, 1:06pm

Thanks for your reply. This is a bit different from caffe, which don’t have the concept of dataloader…

jdhao · November 28, 2017, 1:10pm

Dataset provide an interface to access single sample in dataset using its index. Dataloader is used to provide batch sample for training your models using SGD or similar variants.

ptrblck · November 28, 2017, 1:19pm

And it’s beautiful isn’t it?
Also DataLoader provides several arguments like num_workers to use multi processing for the data loading and preprocessing, shuffle etc.
This might be a good starter for further questions.

chenchr · November 28, 2017, 1:24pm

Thanks. I think it’s more concrete than caffe.

chenchr · November 28, 2017, 3:11pm

However, there will be some difference when reproduce something from caffe using pytorch. As caffe has not the concept of epoch, it update lr in every iter, pytorch update lr in every epoch, there will be some deviation…

ptrblck · November 28, 2017, 3:15pm

What do you mean? The learning rate is usually not updated automatically.
You can adjust the learning rate with e.g.:

def adjust_learning_rate(optimizer, epoch):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    lr = args.lr * (0.1 ** (epoch // 30))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

or with optim.lr_scheduler.

chenchr · November 28, 2017, 3:21pm

Yes, you are right. What I am say is that In the pytorch official example, usually there will be some code like this:

for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on validation set
        prec1 = validate(val_loader, model, criterion)

from imagenet.py.
From the code above, usually pytorch update lr out of the loop over train_loader, but caffe update every iter. I can also change the code adjust_learning_rate(optimizer, epoch) to adjust_learning_rate(optimizer, iter) and move it into the loop over train_loader too. But in many example code, I found that write in the former style.

jdhao · November 28, 2017, 3:32pm

You can set up a counter to count how many iterations have passed, which is rather simple. Then you can adjust the learning rate base on the iteration, the same way as caffe.

I think adjust the learning rate based on epoch maybe enough to achieve good results.

hesham_gabr · April 5, 2021, 4:25am

Thnaks
What is meaning of len(data_loader)? what length indicate?

nivesh_gadipudi · April 5, 2021, 4:34am

It returns the number of batches of data generated from DataLoader.
For instance: if the total samples in your dataset is 320 and you’ve selected batch_size as 32, len(data_loader) will be 10, if batch_size is 16 len(data_loader) is 20.

to keep it simple, len(data_loader) = ceil((no. of samples in dataset)/batchsize)

hesham_gabr · April 5, 2021, 4:54am

Thnks
i have a different batch size of training ,testing and validation dataset
how make all the same batch size
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

i did that but the batch size is still differ

nivesh_gadipudi · April 5, 2021, 5:04am

Generally, you can declare batch_size variable before calling DataLoader for Test, train and val.This should work!

Kapil_Rana · April 5, 2021, 5:10am

In Simple words, train_loader will provide batches of images (in size of batch-size). So number of iteration per epoch would be len(train_loader.dataset)/batch_size

Kapil_Rana · April 5, 2021, 5:16am

It is len(data_loader.dataset)/batchsize.

and len(data_loader.dataset) is total number of images in dataset (which is passed in data_loader).

Kapil_Rana · April 5, 2021, 5:25am

I think it should work. It might be variable name issue. Try to give constant number.
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, num_workers=num_workers, shuffle=True)

Brock_Brown · March 19, 2024, 5:24pm

It seems like every time I want to know something niche about PyTorch I always see your face. Just wanna say thanks for the help.