I have a dataset folder which consists of four sub-folders [say - f1,f2,f3,f4] each with images from different class. How can only read three of specified folders say [f1,f2,f4] for image classification problem? Currently I am reading all the folders by default using
data = datasets.ImageFolder(train_dir,transform=transform)
train_loader = torch.utils.data.DataLoader(data,batch_size=batch_size,sampler=train_sampler)
@ptrblck I was working on a huge kaggle dataset so I wanted to know if there was a way I can only read three out of four subfolders for classes in Pytorch.
ln -s source_dir destination_dir will create a symbolic link of the source directory (or file) to the specified destination without moving any files.
If you would like to use only a few subfolders, you could create a new root folder and then create symbolic links to the desired image folders inside the new root folder.
This will make sure that ImageFolder only sees the desired class folders instead of all.
I am facing the same problem “/home/work/” includes many folders and in "dataroot = ‘/home/work/radius_32_32/’ I have 4 more folders that include my data to use. Where I should add the link line in the code to use only radius_32_32 and subfolders in it?
You don’t need to create symbolic links and can just pass /home/work/radius_32_32 to your ImageFolder, if you want to use this folder as the root and the subfolders as the class folders.
I believe Windows can also create symbolic links (they might have a different name?) so you should be able to use the same workflow in creating a new folder with the sym links to the desired data folder(s).
You can make your batch_size=1 and skip everything other than specific classes during the training loop. Maybe it’s not the most efficient but it’s very easy:
classes: list[str] = sorted(os.listdir(datadir))
targets: list[int] = [classes.index(label) for label in classes if label in ('Dog','Cat')]
for epoch in tqdm(range(epochs)):
for images, labels in tqdm(trainloader):
if not any(x in labels for x in targets):
continue
I had a very similar issue to the original post where I wanted to ignore certain classes within my root directory. After some tinkering I came up with the following solution that works well for me. The solution is based on subclassing the DatasetFolder class and overriding the find_classes method with a slight modification that ignores all folders specified in dropped_classes. The class can be interchangeably used with the standard ImageFolder class:
from torchvision import datasets
IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp')
class ClassSpecificImageFolder(datasets.DatasetFolder):
def __init__(
self,
root,
dropped_classes=[],
transform = None,
target_transform = None,
loader = datasets.folder.default_loader,
is_valid_file = None,
):
self.dropped_classes = dropped_classes
super(ClassSpecificImageFolder, self).__init__(root, loader, IMG_EXTENSIONS if is_valid_file is None else None,
transform=transform,
target_transform=target_transform,
is_valid_file=is_valid_file)
self.imgs = self.samples
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
classes = [c for c in classes if c not in self.dropped_classes]
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
return classes, class_to_idx