Index concept in torch.utils.data.DataLoader

Niki · March 8, 2020, 4:23am

Hi,

I got confused about the image index concept by using torch.utils.data.DataLoader. This index is the index of image for the entire training/testing dataset or just index of image for the mini_batch? If it’s for the mini_batch then it means in the next mini-batch, the image with index 1 is the same as the previous image with index 1?

ptrblck · March 8, 2020, 5:44am

The DataLoader uses a sampler (e.g. RandomSampler if you specify shuffle=True) to create indices in the range [0, len(dataset)]. These indices will be used to index the passed Dataset instance, which will call into its __getitem__(self, index) method.
So yes, if the DataLoader uses the same index in the next epoch, Dataset.__getitem__ will also see the same index.

Niki · March 8, 2020, 1:34pm

Thank you for your reply, @ptrblck. So this means using torch.utils.data.DataLoader for trainloader , then batch_idxs in

for batch_idx, (images, targets) in enumerate(trainloader):

are indices for entire dataset?

ptrblck · March 8, 2020, 8:10pm

No. The custom or internal sampler will create the indices and pass them to the Dataset.
batch_idx is created by your enumerate call and will assign an increasing index to each batch, which usually contains more than a single sample.

Niki · March 8, 2020, 8:24pm

Then how I can get the index of each image?

ptrblck · March 8, 2020, 11:56pm

You could write a custom Dataset and return the index with the data and target.

def __getitem__(self, index):
    # your data loading logic
    x = self.data[index]
    y = self.target[index]
   
    # transformations
    ...

    return x, y, index

This would yield an additional batch of the used indices in your training loop.