How to use ImageFolder with list of images for train and test sets?

ahsanfarooqui · May 4, 2021, 6:00pm

I am working on Stanford Dog Dataset that provides 120 classes of images and a list of 12000 image names for training and about 8000+ names for testing.

How can I use ImageFolder to load train and test sets based on the image names? Below is a how the list looks like.

Train Set:

       [array(['n02085620-Chihuahua/n02085620_4441.jpg'], dtype='<U38')],
       [array(['n02085620-Chihuahua/n02085620_1502.jpg'], dtype='<U38')],
       ...,
       [array(['n02116738-African_hunting_dog/n02116738_6754.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_9333.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_2503.jpg'], dtype='<U48')]],
      dtype=object)

Test Set:

array([[array(['n02085620-Chihuahua/n02085620_2650.jpg'], dtype='<U38')],
       [array(['n02085620-Chihuahua/n02085620_4919.jpg'], dtype='<U38')],
       [array(['n02085620-Chihuahua/n02085620_1765.jpg'], dtype='<U38')],
       ...,
       [array(['n02116738-African_hunting_dog/n02116738_3635.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_2988.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_6330.jpg'], dtype='<U48')]],
      dtype=object)

julianolm · May 4, 2021, 6:15pm

I don’t think I have the most efficient answer, but it is the one that worked for me and solved my problem when I was dividing FMD (Flickr Material Database) for doing cross-validation:

You can create two different instances of the dataset, doing like

dataset_train = datasets.ImageFolder(data_dir, transform=train_transforms)
dataset_val = datasets.ImageFolder(data_dir, transform=val_transforms)

and then you subset each one with the indices that you want:

trainset = torch.utils.data.Subset(dataset_train, train_indices)
valset = torch.utils.data.Subset(dataset_val, val_indices)

For that to work you just have to create the “train_indices” and “val_indices” lists containing all the indices you want on each. And from that on you just use “trainset” and “valset” to create your dataloaders or to do whatever you want.

Notes:

I have used different transforms for train and validation as I think is common practice.
I would appreciate any comments and suggestions about how to do it in a better way.

eqy · May 4, 2021, 6:19pm

You don’t need to pass in image names or even provide a list of images when using ImageFolder. It expects images to be in the typical ‘/class1/[images…]’ ‘/class2/[images…
]’ organization on disk so you only need to pass in the directory to create your training and test set (assuming you have separate directories for training and testing).
For example, the root parameter should just be the path to the directory containing each of the class directories.

If you need to manually separate the sets, you can create a new directory structure for each of the classes and move the corresponding images for the test set to the new directories. Separating the images ahead of time is the easiest way to use ImageFolder if you are not doing cross-validation.