Train on a fraction of the data set


(MirandaAgent) #1

I want to train with SGD on say 10% of cifar10. However, I want that 10% to be the same fraction of cifar10 and not change once training has started. How does one do this?

The only way I’ve thought of doing this was loop through all the data of cifar10 and once I’ve extracted the first 10% of the data, save it in a numpy file and then use that for training later and load that with a different dataloader. Would that work? What do people think?


How does one create a data set in pytorch and save it into a file to later be used?
(Artur Lacerda) #2

You can use torch.utils.data.sampler.SubsetRandomSampler. Something like:

cifar_dataset = torchvision.datasets.CIFAR10(root='./data', transform=transform)
subset_indices = # select your indices here
subset_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(subset_indices), ...)


(MirandaAgent) #3

Ah I think that should work? But I find it confusing that the description says that:

Samples elements randomly from a given list of indices, without replacement.

but at the same time it receives a list of indices. Do you know what’s the deal with that?


(MirandaAgent) #4

btw do you know if the answer u provided can aid in solving:

thats really the problem I am trying to solve.


(MirandaAgent) #5

I think I know how. Just sort the indices according to the criteron I have, save those indices and recover them from a file whenever I need them. Then use ur data sampler!


(Artur Lacerda) #6

SubsetRandomSampler restricts the DataLoader to the list of indices you pass. Otherwise everything is the same: the dataloader generates batches from these indices without replacement and in random order. This is useful for example when you want to split your dataset in train/validation, you can split the indices and build two data_loaders restricted two each subset.

And I believe you can do what you want with SequentialSampler (assuming it its important not to shuffle). Generate the train/test indices using your rule and build two data loaders.


(MirandaAgent) #7

The thing is that during training I do want the batches to be produced Randomly, but I don’t want my subset to be produced randomly, I want it to be created according to the criterion I developed. How does that change things? Should I use SequentialSampler or RandomSampler? This second sampler through me off a bit.


(Artur Lacerda) #8

Maybe something like this:

cifar_dataset = torchvision.datasets.CIFAR10(root='./data', transform=transform)
train_indices = # select train indices according to your rule
test_indices = # select test indices according to your rule
train_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(train_indices))
test_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(test_indices))

If your rule can be computed fast, you don’t even have to bother saving it to a file. Btw, CIFAR10 has a standard train/test division and the snippet above splits the train set further. Do you want to ignore the original split or just using the train set is fine?


(MirandaAgent) #9

oh no I just realized my criterion is sort of expensive to compute I requires changing the labels according to my criterion…I think Im going to have to create and save a new data set so that I change the labels according to this criterion only once…


(MirandaAgent) #10

actually artur, how do you get any indices?


How does one obtain indicies from a dataloader?