Create a dataset subset with independent indexing

kpatil · August 18, 2020, 11:20am

I want to create a subset of a dataset which has its own indexing, i.e. following range(len(subset)).

When I create a subset with provided utility and iterate over the subset then the indexing followed is that of the original dataset.

n= len(dataset)
indices = np.random.choice(n, int(n*0.5))
subset= torch.utils.data.Subset(dataset, indices)

loader = DataLoader(subset, batch_size=100, shuffle=True, num_workers=0)

for batchid, (inputs, labels, index) in enumerate( iter(loader)):
    assert(np.max(np.array(index)) < len(subset)) # this assert is invoked

ptrblck · August 20, 2020, 9:28am

Your created indices will contain random indices in the range [0, len(dataset)] and will not be clipped by the length of the subset.
I assume that the returned index value is the index passed to __getitem__, so the behavior is expected.

If you want to pass indices in the range [0, len(subset)], you could directly pass these indices to Subset using torch.arange(int(n*0.5)).