About the relation between batch_size and length of data_loader

Hello. I want to ask about about the relation between batch_size and length of data_loader…
Assuming I have a dataset with 1000 images and set as the train loader.
During training there will be a python code like:

for i, (input, target) in enumerate(train_loader):

I not sure about that if I set the batchsize as 10, will the train_loader’s length be changed to 100? And it’s length become to 50 if I set the batchsize as 20? Or no matter how I set the batchsize, the trainloader will keep it’s length 1000?
Thank!

2 Likes

I am not sure about your issues.
then, I hold the view that: if the data_loader is 100 (constant), and I set the batch_size = 20; I will know one iteration can fetch 20 data from data_loader, training all data needs 100/20=5 iterations.

But the number of backprop is different…

The length of the loader will adapt to the batch_size. So if your train dataset has 1000 samples and you use a batch_size of 10, the loader will have the length 100.
Note that the last batch given from your loader can be smaller than the actual batch_size, if the dataset size is not evenly dividable by the batch_size. E.g. for 1001 samples, batch_size of 10, train_loader will have len(train_loader)=101 and the last batch will only contain 1 sample. You can avoid this by setting drop_last=True.

class MyDataset(Dataset):
    def __init__(self, size):
        self.x = torch.randn(size, 1)
    
    def __getitem__(self, index):
        return self.x[index]

    def __len__(self):
        return len(self.x)

dataset = MyDataset(1001)

data_loader = DataLoader(dataset,
                         batch_size=10)

len(data_loader)

for batch_idx, data in enumerate(data_loader):
    print 'batch idx{}, batch len {}'.format(
        batch_idx, len(data))

data_loader = DataLoader(dataset,
                     batch_size=10,
                     drop_last=True)

len(data_loader)

for batch_idx, data in enumerate(data_loader):
    print 'batch idx{}, batch len {}'.format(
        batch_idx, len(data))
28 Likes

Thanks for your reply. This is a bit different from caffe, which don’t have the concept of dataloader…

Dataset provide an interface to access single sample in dataset using its index. Dataloader is used to provide batch sample for training your models using SGD or similar variants.

And it’s beautiful isn’t it? :slight_smile:
Also DataLoader provides several arguments like num_workers to use multi processing for the data loading and preprocessing, shuffle etc.
This might be a good starter for further questions.

2 Likes

Thanks. :yum::yum: I think it’s more concrete than caffe.

However, there will be some difference when reproduce something from caffe using pytorch. As caffe has not the concept of epoch, it update lr in every iter, pytorch update lr in every epoch, there will be some deviation…

What do you mean? The learning rate is usually not updated automatically.
You can adjust the learning rate with e.g.:

def adjust_learning_rate(optimizer, epoch):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    lr = args.lr * (0.1 ** (epoch // 30))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

or with optim.lr_scheduler.

1 Like

Yes, you are right. What I am say is that In the pytorch official example, usually there will be some code like this:

for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on validation set
        prec1 = validate(val_loader, model, criterion)

from imagenet.py.
From the code above, usually pytorch update lr out of the loop over train_loader, but caffe update every iter. I can also change the code adjust_learning_rate(optimizer, epoch) to adjust_learning_rate(optimizer, iter) and move it into the loop over train_loader too. But in many example code, I found that write in the former style.

You can set up a counter to count how many iterations have passed, which is rather simple. Then you can adjust the learning rate base on the iteration, the same way as caffe.

I think adjust the learning rate based on epoch maybe enough to achieve good results.

Thnaks
What is meaning of len(data_loader)? what length indicate?

It returns the number of batches of data generated from DataLoader.
For instance: if the total samples in your dataset is 320 and you’ve selected batch_size as 32, len(data_loader) will be 10, if batch_size is 16 len(data_loader) is 20.

to keep it simple, len(data_loader) = ceil((no. of samples in dataset)/batchsize)

Thnks
i have a different batch size of training ,testing and validation dataset
how make all the same batch size
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

i did that but the batch size is still differ

Generally, you can declare batch_size variable before calling DataLoader for Test, train and val.This should work!

1 Like

In Simple words, train_loader will provide batches of images (in size of batch-size). So number of iteration per epoch would be len(train_loader.dataset)/batch_size

It is len(data_loader.dataset)/batchsize.

and len(data_loader.dataset) is total number of images in dataset (which is passed in data_loader).

1 Like

I think it should work. It might be variable name issue. Try to give constant number.
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, num_workers=num_workers, shuffle=True)

It seems like every time I want to know something niche about PyTorch I always see your face. Just wanna say thanks for the help. :slight_smile:

2 Likes