Train on a fraction of the data set

Brando_Miranda · April 21, 2018, 1:40am

I want to train with SGD on say 10% of cifar10. However, I want that 10% to be the same fraction of cifar10 and not change once training has started. How does one do this?

The only way I’ve thought of doing this was loop through all the data of cifar10 and once I’ve extracted the first 10% of the data, save it in a numpy file and then use that for training later and load that with a different dataloader. Would that work? What do people think?

arturml · April 21, 2018, 1:53am

You can use torch.utils.data.sampler.SubsetRandomSampler. Something like:

cifar_dataset = torchvision.datasets.CIFAR10(root='./data', transform=transform)
subset_indices = # select your indices here
subset_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(subset_indices), ...)

Brando_Miranda · April 22, 2018, 3:21am

Ah I think that should work? But I find it confusing that the description says that:

Samples elements randomly from a given list of indices, without replacement.

but at the same time it receives a list of indices. Do you know what’s the deal with that?

Brando_Miranda · April 22, 2018, 3:30am

btw do you know if the answer u provided can aid in solving:

thats really the problem I am trying to solve.

Brando_Miranda · April 22, 2018, 4:01am

I think I know how. Just sort the indices according to the criteron I have, save those indices and recover them from a file whenever I need them. Then use ur data sampler!

arturml · April 22, 2018, 4:04am

SubsetRandomSampler restricts the DataLoader to the list of indices you pass. Otherwise everything is the same: the dataloader generates batches from these indices without replacement and in random order. This is useful for example when you want to split your dataset in train/validation, you can split the indices and build two data_loaders restricted two each subset.

And I believe you can do what you want with SequentialSampler (assuming it its important not to shuffle). Generate the train/test indices using your rule and build two data loaders.

Brando_Miranda · April 22, 2018, 4:06am

The thing is that during training I do want the batches to be produced Randomly, but I don’t want my subset to be produced randomly, I want it to be created according to the criterion I developed. How does that change things? Should I use SequentialSampler or RandomSampler? This second sampler through me off a bit.

arturml · April 22, 2018, 4:13am

Maybe something like this:

cifar_dataset = torchvision.datasets.CIFAR10(root='./data', transform=transform)
train_indices = # select train indices according to your rule
test_indices = # select test indices according to your rule
train_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(train_indices))
test_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(test_indices))

If your rule can be computed fast, you don’t even have to bother saving it to a file. Btw, CIFAR10 has a standard train/test division and the snippet above splits the train set further. Do you want to ignore the original split or just using the train set is fine?

Brando_Miranda · April 23, 2018, 5:11pm

oh no I just realized my criterion is sort of expensive to compute I requires changing the labels according to my criterion…I think Im going to have to create and save a new data set so that I change the labels according to this criterion only once…

Brando_Miranda · April 23, 2018, 6:13pm

actually artur, how do you get any indices?

OrielBanne · June 18, 2020, 6:36pm

Hi Artur -

I tried your solution with another dataset (my data). however, the sampler did not cut the data into two parts according to the indices . total # of elements I had for inference training and validation was 300 altogether. train indices included 260 and test indices included 40. after construction of the dataloader - I check: len(test_loader.dataset.tensors[0])
and I get 300 …
I must be doing something wrong
one difference I had - I could not use shuffle = True, because then I get: ValueError: sampler option is mutually exclusive with shuffle

here are the two lines:
train_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, sampler=SubsetRandomSampler(train_idx))
test_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, sampler=SubsetRandomSampler(test_idx))

can you see a reason why this should not work?