Using only some of Subfolders in dataset

Blueberry · March 6, 2020, 5:23pm

I have a dataset folder which consists of four sub-folders [say - f1,f2,f3,f4] each with images from different class. How can only read three of specified folders say [f1,f2,f4] for image classification problem? Currently I am reading all the folders by default using

data = datasets.ImageFolder(train_dir,transform=transform)
train_loader = torch.utils.data.DataLoader(data,batch_size=batch_size,sampler=train_sampler)

ptrblck · March 7, 2020, 5:13am

You could create a new folder with symbolic links to the desired 3 folders or alternatively move the unwanted folder to another place.

Blueberry · March 7, 2020, 9:28am

@ptrblck I was working on a huge kaggle dataset so I wanted to know if there was a way I can only read three out of four subfolders for classes in Pytorch.

ptrblck · March 8, 2020, 1:41am

Symbolic links won’t move any data. Would that work for you?

Blueberry · March 9, 2020, 11:48am

@ptrblck can you please elaborate on what do you mean by that and how can that be done.

ptrblck · March 9, 2020, 11:21pm

ln -s source_dir destination_dir will create a symbolic link of the source directory (or file) to the specified destination without moving any files.
If you would like to use only a few subfolders, you could create a new root folder and then create symbolic links to the desired image folders inside the new root folder.
This will make sure that ImageFolder only sees the desired class folders instead of all.

PS: Assuming you use a Linux OS.

saba · November 9, 2020, 3:09am

Hi Ptrblck,

I am facing the same problem “/home/work/” includes many folders and in "dataroot = ‘/home/work/radius_32_32/’ I have 4 more folders that include my data to use. Where I should add the link line in the code to use only radius_32_32 and subfolders in it?

" ln -s /home/work/ /home/work/radius_32_32"

dataroot = '/home/work/radius_32_32/'

dataset = dset.ImageFolder(root=dataroot,
                            transform=transforms.Compose([
                                transforms.Resize(image_size),
                                transforms.CenterCrop(image_size),
                                transforms.ToTensor(),
                                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
                            ]))

dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                         shuffle=True, num_workers=workers)

ptrblck · November 9, 2020, 5:20am

You don’t need to create symbolic links and can just pass /home/work/radius_32_32 to your ImageFolder, if you want to use this folder as the root and the subfolders as the class folders.

Shyam_Gupta196 · June 19, 2021, 1:03pm

Hi @ptrblck
can you suggest something for windows,
I was using cats vs dogs dataset and it turns out i only want to use dogs subfolder .

What should i do so that imagefolder only sees dog folder ,

I have data in read only mode and cannot be changed ,
hence i cannot move it to other folder ,

pls help me out

ptrblck · June 19, 2021, 7:46pm

I believe Windows can also create symbolic links (they might have a different name?) so you should be able to use the same workflow in creating a new folder with the sym links to the desired data folder(s).

Kaykay · September 10, 2021, 6:12pm

You can make your batch_size=1 and skip everything other than specific classes during the training loop. Maybe it’s not the most efficient but it’s very easy:

classes: list[str] = sorted(os.listdir(datadir))
targets: list[int] = [classes.index(label) for label in classes if label in ('Dog','Cat')]

for epoch in tqdm(range(epochs)):
	for images, labels in tqdm(trainloader):
		if not any(x in labels for x in targets):
			continue

danK · December 10, 2021, 2:25pm

I had a very similar issue to the original post where I wanted to ignore certain classes within my root directory. After some tinkering I came up with the following solution that works well for me. The solution is based on subclassing the DatasetFolder class and overriding the find_classes method with a slight modification that ignores all folders specified in dropped_classes. The class can be interchangeably used with the standard ImageFolder class:

from torchvision import datasets
IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp')

class ClassSpecificImageFolder(datasets.DatasetFolder):
    def __init__(
            self,
            root,
            dropped_classes=[],
            transform = None,
            target_transform = None,
            loader = datasets.folder.default_loader,
            is_valid_file = None,
    ):
        self.dropped_classes = dropped_classes
        super(ClassSpecificImageFolder, self).__init__(root, loader, IMG_EXTENSIONS if is_valid_file is None else None,
                                                       transform=transform,
                                                       target_transform=target_transform,
                                                       is_valid_file=is_valid_file)
        self.imgs = self.samples

    def find_classes(self, directory):
        classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
        classes = [c for c in classes if c not in self.dropped_classes]
        if not classes:
            raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")

        class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
        return classes, class_to_idx