Load big amount of unlabeled images and predict classes

adama · September 21, 2021, 6:48am

I have a finetuned model and want to apply it to unlabeled images. The images are located in one directory with several subfolders. Each suchfolder can contain several subfolders as well. So I have to get them recursivly.

I have around 23.000 images to make a binary classification on. I thought it would be more efficient to load the data with a dataloader into my network rather than loading each image after another.

If the image gets classified as True (1) I want to copy the original image into another folder.

To get the data I am using os.walk().

rootpath = <mypath>
paths = []
for subdir, dirs, files in os.walk(rootpath):
    for file in files:
        #print os.path.join(subdir, file)
        filepath = subdir + os.sep + file
        paths.append(filepath)

How can I bring this into a pytorch Dataset where I can feed it into the DataLoader?

arya47 · September 21, 2021, 7:28am

This a one of several approaches you can take

Here in the init method you can create a structure which stores paths to all images, any structure that can be indexed.
eg: list of image paths [’/img/1.png’ , ‘/img/2.png’ …]
Then in the getitem method you can load the corresponding image to a index x

(Add parameters to the init function if you want to)

from torch.utils.data import Dataset

class DataClass(Dataset):
    def __init__(self):
        self.list_of_paths = ... # Here create a list of all image paths / paths..

    def  __len__(self):
        return len(self.list_of_paths)

    def __getitem__(self, x):
        image_path = self.list_of_paths[x] # Gives the path to an image
        image = load_image ... # Here load your image using your path
        
        return image

then create a ds

train_ds = DataClass()
train_dl = DataLoader(train_ds ...)

adama · September 21, 2021, 8:50am

Thank you for your hint.
I made following script:

rootpath = '<mypath>'
paths = []
for subdir, dirs, files in os.walk(rootpath):
    for file in files:
        #print os.path.join(subdir, file)
        filepath = subdir + os.sep + file
        paths.append(filepath)


class ImageDataset(Dataset):
    def __int__(self):
        self.imagelist = paths

    def __len__(self):
        return len(self.imagelist)

    def __getitem__(self, index):
        self.imagepath = self.imagelist[index]
        self.image = Image.open(self.imagepath)

        transform = transforms.Compose([
                    transforms.ToPILImage(),
                    transforms.Resize(256),
                    transforms.CenterCrop(224),
                    transforms.ToTensor(),
                    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
                    ])

        i = transform(self.image)
        return i


imagedataset = ImageDataset()

imagedl = torch.utils.data.DataLoader(imagedataset, batch_size=300, drop_last=True)

for index, img in enumerate(imagedl):
    print(index, img)

And i get following error if I want to iterate over the dataloader:

AttributeError: 'ImageDataset' object has no attribute 'imagelist'

Why? I declare self.imagelist as a list of all my imagepaths

adama · September 21, 2021, 10:52am

transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            ])

class CustomDataSet(Dataset):
    def __init__(self, imagelist, transform):
        self.imagepaths = imagelist
        self.transform = transform

    def __len__(self):
        return len(self.imagepaths)

    def __getitem__(self, index):
        self.imagepath = self.imagepaths[index]
        self.image = Image.open(self.imagepath)

        self.i = self.transform(self.image)
        return self.i

imagedataset = CustomDataSet(paths, transform)
imagedl = torch.utils.data.DataLoader(imagedataset, batch_size=300, drop_last=True)

Yeah, can´t work if I spelled call my __init__ function int(). Changed that and now it´s working fine.