DataLoader batches but also give you the original dataset indices of that batch

gandolfin69420 · September 15, 2023, 6:20pm

Not sure how to formulate the question to search whether it has been already asked.
So I have a DataLoader that is also shuffled, so for each batch of size batch_size, I want a index/tensor of length batch_size containing the indices the samples have had in the original dataset.
So if for example batch_size=2 the DataLoader randomly chose the 100th and 420th sample of the dataset, the batch should instead be a 2-tuple of (index_or_tensor(100,420), tensor(sample_100,sample_420)).

Thanks for your answer

ptrblck · September 15, 2023, 6:33pm

You can return the indices in the Dataset.__getitem__ method including the samples by creating a custom Dataset.

yiftach · September 16, 2023, 4:44pm

Alternatively, TensorDataset should work without requiring a custom class.
It accepts tensors as input, and returns a tuple:

import torch
from torch.utils.data import TensorDataset

dummy_data = torch.arange(100, 1000) # replace with your data

dataset = TensorDataset(torch.arange(len(dummy_data)), dummy_data)

dataset[[100, 420]] # gives (tensor([100, 420]), tensor([200, 520]))