How do I sample images from a dataset having the no. of images to be greater than a specified threshold?

I am trying to use a dataset having nearly 5200 classes of images and total images to be somewhere around 14000, now some of these classes have only image per class i want to use only those images which have atleast say 3 or more images per class , one way could be by iterating over the entire directory and copying the images to a new directory specifying the given condition , the dataset is of the standard ImageFolder form so i wanted to know is there any pytorchic way to sample images from the dataset while creation of dataloaders inspite of making a new directory and directly doing it via pytorch dataloaders or samplers?
If anyone can suggest any possible soln. to it . Thanks in advance.

@ptrblck_de can you have a look at it?

You could access the .targets attribute from the ImageFolder directly and check the number of unique targets in it via torch.unique(dataset.targets, return_counts=True). Once you know which targets you want to keep, you could filter them out and pass the corresponding sample indices to a Subset, which would then only return the wanted class samples.

1 Like

@ptrblck Okay that being done but there’s one more thing I am confused about as per the pytorch docs the torch.unique() returns the unique elements , indices and counts , how do i get the indices of the classes i want to keep? suppose i want to enforce a constraint that my count of images per class should be >=3 then how do i get the indices satisfying the condition?

Here is an example of my suggestion:

torch.manual_seed(0)
targets = torch.randint(0, 5, (10,))
u, c = torch.unique(targets, return_counts=True)
print(u, c)
> tensor([0, 2, 3, 4]) tensor([1, 2, 4, 3])
classes_to_keep = u[c >= 3]
print(classes_to_keep)
> tensor([3, 4])

idx = []
for class_to_keep in classes_to_keep:
    idx.extend((targets == class_to_keep).nonzero())
idx = torch.tensor(idx)
print(idx)
> tensor([2, 4, 7, 9, 0, 1, 5])

print(targets[idx])
> tensor([3, 3, 3, 3, 4, 4, 4])

You could then use the idx in a Subset.

1 Like

Thanks a lot @ptrblck it solved my problem.

@ptrblck Using your suggestions I was trying to create a reduced dataset for a problem at hand, however I have been encountering Cuda runtime error:device -side assert triggered repeatedly for the following code where I first use the Imagefolder to get my dataset and then reduce it on the basis of the threshold kept as 3 followed by a stratified split of the paths and labels to have my train,val and test sets which I finally use for my model. Do you see any error in this snippet ? Also i think it can be optimized but not really sure of how should i do it . Can you provide suggestions for that?


data_dir = original_cropped_dir_path

threshold = 3

transformations  = T.Compose([

    T.Resize(size = 256),

    T.ToTensor(),

    fixed_image_standardization

])

dataset = datasets.ImageFolder(data_dir, transform=transformations)

targets = torch.tensor(dataset.targets)

unique_classes, counts = torch.unique(targets, return_counts=True)

# print(unique_classes, counts)

classes_to_keep = unique_classes[counts>=threshold]

# print(len(classes_to_keep))

indices = []

for class_to_keep in classes_to_keep:

    indices.extend((targets == class_to_keep).nonzero())

# indices = torch.tensor(indices)

# print(indices)

for i in range(len(indices)):

    indices[i] = indices[i].item()

print(indices)

print(len(indices))

re_targets = targets[indices].tolist()

print(re_targets[0])

print(len(re_targets))

re_img_paths = []

for i in indices:

    re_img_paths.append(dataset.imgs[i][0])

print(re_img_paths[0])

print(len(re_img_paths))

from sklearn.model_selection import train_test_split

t_paths , test_paths , t_labels , test_labels = train_test_split(re_img_paths,re_targets,stratify = re_targets,test_size = 0.2)

train_paths , val_paths , train_labels , val_labels = train_test_split(t_paths,t_labels,stratify = t_labels,test_size = 0.2)

# print(np.unique(t_labels , return_counts=True))

# print(np.unique(val_labels , return_counts=True))

# print(np.unique(test_labels,return_counts=True))

# print(np.unique(np.array(re_targets),return_counts=True))

# sns.countplot(data = re_targets)

# sns.countplot(data = re_targets.numpy())

# from sklearn.model_selection import train_test_split

batch_size = 32

validation_size = .2

test_size = 0.1

shuffle_dataset = True

num_workers = 0 if os.name == 'nt' else 2

# if shuffle_dataset :

#     np.random.seed(seed_val)

#     np.random.shuffle(train_val_indices)

# train_indices, val_indices = train_val_indices[val_split:], train_val_indices[:val_split]

# # Creating PT data samplers and loaders:

# train_sampler = SubsetRandomSampler(train_indices)

# valid_sampler = SubsetRandomSampler(val_indices)

# test_sampler = SubsetRandomSampler(test_indices)

class LFWDataset(Dataset):

    def __init__(self,img_paths,labels,transforms):

        super(LFWDataset,self).__init__()

        self.img_paths = img_paths

        self.labels = labels

        self.transforms = transforms 

    def __getitem__(self,idx):

        path = self.img_paths[idx]

        img = Image.open(path).convert('RGB')

        img = self.transforms(img)

        label = torch.tensor(self.labels[idx])

        return img,label

    def __len__(self):

        return len(self.labels)

train_dataset = LFWDataset(train_paths,train_labels,transformations)

val_dataset = LFWDataset(val_paths,val_labels,transformations)

test_dataset = LFWDataset(test_paths,test_labels,transformations)

train_loader = torch.utils.data.DataLoader(

                                train_dataset, batch_size=batch_size, 

                                drop_last = True,

                                num_workers = num_workers

                )

validation_loader = torch.utils.data.DataLoader(

                                val_dataset, batch_size=batch_size,

                                drop_last = True,

                                num_workers = num_workers

                    )

test_loader = torch.utils.data.DataLoader(

                                test_dataset, batch_size=batch_size,

                                drop_last = True,

                                num_workers = num_workers

                    )

I guess you might be filtering out N classes, which don’t have the class indices [0, N-1], but might have larger values, which would create errors while trying to calculate the loss in e.g. nn.CrossEntropyLoss.
This loss function expects the model output in the shape [batch_size, nb_classes, *] and a target tensor in the shape [batch_size, *] containing class indices in [0, nb_classes-1].
If you are e.g. filtering out the class indices [3, 4, 5], you would have to map them to [0, 1, 2].

@ptrblck thanks . yaa u r right it must be the case, any suggestions for tackling this?

You would have to map these indices to [0, nb_classes-1] as described in the previous post.

EDIT: here is a code snippet in case you get stuck:

target = torch.randint(3, 6, (10,))
print(target)
> tensor([3, 4, 3, 3, 3, 5, 5, 4, 3, 4])

unique = torch.unique(target)
for i, u in enumerate(unique):
    target[target==u] = i
print(target)
> tensor([0, 1, 0, 0, 0, 2, 2, 1, 0, 1])
1 Like

Thanks @ptrblck it solves the problem.Any suggestions for optimizing the data flow since i am using the imagefolder first which already takes time followed by this reduction procedure of the dataset according to the threshold?

If the dataset is static and you are using the same labels every time, you could create a new root folder (for the ImageFolder) and create symbolic links of the desired class folders in this new root.
The ImageFolder will then use only these classes and will assign them the right class index.