How to load only 10% of imagnet from datasets.ImageFolder

Hi everyone,

I want to select only 10% of imagenet from datasets.ImageFolder as follows:

image_datasets = {x:  datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}

I know I should use torch.utils.data.Subset , but I am not sure how I should feed image_datasets to this, espacially because the size of train and val is not equal. I appreciate any suggestions

Subset accepts a dataset and the desired indices.
Assuming you want to use 10% of your original dataset, something like this should work:

dataset = TensorDataset(torch.randn(100, 1), torch.randint(0, 1000, (100,)))

subset_ratio = 0.1
nb_samples = len(dataset)
# draw subset_ratio shuffled indices 
indices = torch.randperm(nb_samples)[:int(subset_ratio*nb_samples)]
print(indices)
# tensor([60, 66, 22, 11, 32, 83, 29, 50, 81, 90])

subset = torch.utils.data.Subset(dataset, indices=indices)

Thanks for response, @ptrblck
The problem is when I use Subset, Dataloader shows an error.

I want to modify this code:

image_datasets = {x:  datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=args.batch_size,
                        shuffle=True, num_workers=2 ) for x in ['train', 'val']}

I want to modify it, in a way that only 10% of train, and 10% of val be considered. I used subset as follows:

image_datasets = {x:  datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}

image_train = data_utils.Subset(image_datasets['train'], 60000)
image_val = data_utils.Subset(image_datasets['val'], 2500)

dataloaders_train = {x: torch.utils.data.DataLoader(image_train[x], batch_size=args.batch_size, shuffle=True, num_workers=2 ) for x in ['train']}
dataloaders_val = {x: torch.utils.data.DataLoader(image_val[x], batch_size=args.batch_size,shuffle=True, num_workers=2 ) for x in ['val']}

but this gives an error,

TypeError: β€˜int’ object is not subscriptable

I am not sure what I am missing here. Any suggestions?

You can index the image_datasets dict beforehand, wrap them into a Subset, and then pass them to a DataLoader.

Can you please explain more? how should I do that?

Here is an example:

image_datasets = {x:  TensorDataset(torch.randn(100, 1)) for x in ["train", "val"]}
lens = {x: len(image_datasets[x]) for x in ["train", "val"]}
image_datasets = {x: torch.utils.data.Subset(image_datasets[x], indices=torch.randperm(int(lens[x]*0.1))) for x in ["train", "val"]}

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=10, shuffle=True, num_workers=2 ) for x in ["train", "val"]}
dataloader_train = dataloaders["train"]
dataloader_val = dataloaders["val"]
2 Likes

It worked. Thanks a lot @ptrblck!