How to get a part of datasets?

sjt524 · May 20, 2020, 10:45am

Sorry for my poor English.
This is my code:
trainset = datasets.MNIST(‘data’, train=True, download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset,batch_size=32, shuffle=True)

Now I want to choose a part of train sets(like 3000 images and labels) from shuffled datasets every epoch. I want to know how to shuffle the datasets and then choose index from 0 to 2999. Please help me

ptrblck · May 21, 2020, 8:54am

You could manually shuffle the indices using:

indices = torch.randperm(len(train_dataset))[:3000]

and pass these indices to a RandomSubsetSampler, which can then be passed to the DataLoader.

sjt524 · May 21, 2020, 9:14am

Thanks for your help

sjt524 · May 23, 2020, 7:24am

Sorry to distrub you.
There are 10 classes in MNIST datasets.Like 0,1,2,3… Now I want to choose 100 samples in each class and convert them to ‘trainset’ then use the new trainset like follow.
trainloader = torch.utils.data.DataLoader(trainset,batch_size=batch_size, shuffle=True)
Can you help me?

ptrblck · May 23, 2020, 9:06pm

If you are fine with approx. 100 samples, which were randomly drawn, this code should work:

# Setup
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5,), (0.5,))])

dataset = torchvision.datasets.MNIST('./data/', train=True, transform=transform)

# Split the indices in a stratified way
indices = np.arange(len(dataset))
train_indices, test_indices = train_test_split(indices, train_size=100*10, stratify=dataset.targets)

# Warp into Subsets and DataLoaders
train_dataset = Subset(dataset, train_indices)
test_dataset = Subset(dataset, test_indices)

train_loader = DataLoader(train_dataset, shuffle=True, num_workers=2, batch_size=10)
test_loader = DataLoader(train_dataset, shuffle=False, num_workers=2, batch_size=10)


# Validation
train_targets = []
for _, target in train_loader:
    train_targets.append(target)
train_targets = torch.cat(train_targets)

print(train_targets.unique(return_counts=True))
> (tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), tensor([ 99, 112,  99, 102,  97,  90,  99, 105,  98,  99]))

Otherwise you could probably use a loop for each class to get 100 random corresponding class indices and could then use the same Subset approach.

sjt524 · May 24, 2020, 2:39am

Thanks, sorry to occupy your precious time.

randbit · December 1, 2020, 6:40am

@ptrblck I’d like to access the classes of the dataloader, but currently have to switch between loader.dataset.classes and loader.dataset.dataset.classes, depending on whether I used Subset or not. Is there a way around this?

ptrblck · December 1, 2020, 7:24am

I don’t know what the best approach would be besides checking, if the loader.dataset is an instance of Subset.

randbit · December 2, 2020, 8:50pm

@ptrblck Thanks. Currently, I’ve written the following utility function:

def get_classes(dataset):
    while hasattr(dataset, 'dataset'):
        dataset = dataset.dataset
    return dataset.classes

From an API standpoint, might it not be better for Subset to inherit classes, targets (properly filtered according to the indices), and perhaps other relevant attributes from its parent?

ptrblck · December 3, 2020, 4:42am

That’s not necessarily that easy, if your original Datasets lazily loads the data.
Currently Subset only uses the passed indices to forward them to the underlying Dataset.
This works fine since Subset has no knowledge about the Dataset and just acts as a “filter” for the indices.
If you want to forward some dataset internals such as .classes, .targets etc., Subset would need to know what kind of Dataset you are using and which attributes to expose.
Your underlying dataset can of course be a custom Dataset, which doesn’t provide these attributes as they might be unknown during initialization.