What is len(dataloader) equal to?

Hello all.
I recently noticed the len(dataloader) is not the same as len(dataloader.dataset)
based on Udacity Pytorch course, I tried to calculate accuracy with the following lines of codes :

accuracy=0
for imgs, labels in dataloader_test: 
    preds = model(imgs)
    values, indexes = preds.topk(k=1, dim=1) 
    result = (indexes == labels).float()
    accuracy += torch.mean(result)
print(f'acc_val = {accuracy / len(dataloader_test)}'

For the record, Udacity wrote this :

        test_loss = 0
        accuracy = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for images, labels in testloader:
                log_ps = model(images)
                test_loss += criterion(log_ps, labels)
                
                ps = torch.exp(log_ps)
                top_p, top_class = ps.topk(1, dim=1)
                equals = top_class == labels.view(*top_class.shape)
                accuracy += torch.mean(equals.type(torch.FloatTensor))
                
        train_losses.append(running_loss/len(trainloader))
        test_losses.append(test_loss/len(testloader))

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.3f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.3f}.. ".format(test_loss/len(testloader)),
              "Test Accuracy: {:.3f}".format(accuracy/len(testloader)))

and as you can see below, the validation accuracy is reported like this:
"Test Accuracy: {:.3f}".format(accuracy/len(testloader)))

Solen(testloader) must match the whole testset. also, two lines above it in :
test_losses.append(test_loss/len(testloader))
its dividing the loss by the len(testloader), so it should be equal to the whole test set size otherwise it doesnt make sense!

In my case, it only prints 313 as my dataloader_test.
and my dataloader_test is defined as follows :

dataset_train = datasets.MNIST(root='MNIST', train=True, transform=transformations, download=True)
dataset_test = datasets.MNIST(root='MNIST', train=False, transform=transformations, download=True)
import torch.utils.data as data
dataloader_train = data.DataLoader(dataset_train, batch_size=32, shuffle=True, num_workers=2)
dataloader_test = data.DataLoader(dataset_test, batch_size=32,shuffle=False,num_workers=2)

print(f'test dataloader size: {len(dataloader_test)}')

So what am I missing here? why am I getting 313 for len(dataloader_test) while I shoud be getting 10K for MNIST test set?

5 Likes

it equals number of batches

20 Likes

@SimonW or the sake of sanity check, is:

len(dataloader) = dataset size / batch size

?
I am getting these values as I print which are confusing me:

len(dataloaders['train'].dataset)=236436
len(dataloaders['train'])=59109
len(dataloaders['train'])/opts.batch_size=14777.25

related: How can I know the size of data_loader when i use: torchvision.datasets.ImageFolder

ok here is the code:

    def __len__(self) -> int:
        if self._dataset_kind == _DatasetKind.Iterable:
            # NOTE [ IterableDataset and __len__ ]
            #
            # For `IterableDataset`, `__len__` could be inaccurate when one naively
            # does multi-processing data loading, since the samples will be duplicated.
            # However, no real use case should be actually using that behavior, so
            # it should count as a user error. We should generally trust user
            # code to do the proper thing (e.g., configure each replica differently
            # in `__iter__`), and give us the correct `__len__` if they choose to
            # implement it (this will still throw if the dataset does not implement
            # a `__len__`).
            #
            # To provide a further warning, we track if `__len__` was called on the
            # `DataLoader`, save the returned value in `self._len_called`, and warn
            # if the iterator ends up yielding more than this number of samples.

            # Cannot statically verify that dataset is Sized
            length = self._IterableDataset_len_called = len(self.dataset)  # type: ignore
            if self.batch_size is not None:  # IterableDataset doesn't allow custom sampler or batch_sampler
                from math import ceil
                if self.drop_last:
                    length = length // self.batch_size
                else:
                    length = ceil(length / self.batch_size)
            return length
        else:
            return len(self._index_sampler)

Your calculation is a bit wrong, since you are dividing the number of batches by the batch size:

len(dataloaders['train'].dataset)=236436
len(dataloaders['train'])=59109
len(dataloaders['train'])/opts.batch_size=14777.25 # this is wrong
len(dataloaders['train'].dataset) / opts.batch_size = 59109 # 236436 / 4 = 59109 
7 Likes

I had just figured it out…darn it. That was a subtle typo…Thanks! :slight_smile:

1 Like

I believe it gives the batch size. If you want to get the total number of datapoints to find accuracy, you’ll have to divide by trainloader.__len__(). This gives the number of datapoints.