How does one create a data set in pytorch and save it into a file to later be used?


(MirandaAgent) #1

I want to extract data from cifar10 in a specific order according to some criterion f(Image,label) (for the sake of an example lets say f(Image,label) simply computes the sum of all the pixels in Image). Then it I want to generate 1 file for the train set and 1 file for the test set that I can later load in a dataloader to use for training a neural net.

How do I do this? My current idea was simply to loop through the data with data loader with shuffle off and remember the indices of the images and the score and then sort the indices according to the score and then loop through everything again and create some giant numpy array and save it. After I save it I’d use torch.utils.data.TensorDataset(X_train, X_test) to wrap with TensorDataset and feed to DataLoader.

I think it might work for a small data set like cifar10 at the very least, right?


Another very important thing for me is that I also want to only train on the first K images (especially since I already sorted them the first K have a special meaning which I want to keep) so respecting but training only with a fraction will be important.


Train on a fraction of the data set
(MirandaAgent) #2

Using Artur’s solution:

Just sort the indices according to the criteron I have, save those indices and recover them from a file whenever I need them. Then use the data sampler!


(MirandaAgent) #3
cifar_dataset = torchvision.datasets.CIFAR10(root='./data', transform=transform)
train_indices = # select train indices according to your rule
test_indices = # select test indices according to your rule
train_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(train_indices))
test_loader = torch.utils.data.DataLoader(cifar_dataset, batch_size=32, shuffle=True, sampler=SubsetRandomSampler(test_indices))

(MirandaAgent) #4