Load unlabeld images from custom Dataset

GioSfat · January 25, 2023, 5:34pm

Hello everyone!
I have a custom dataset with images in specific classes. I have saved this dataset on my computer using folders and subfolders.

Train Dataset :
-5_1
-5_2
-5_3
-etc…

Where the subfolders(5_1, 5_2, etc.) are the classes of the images. I want to use semi-supervised training where both labeled and unlabeled images must be used. But I don’t know how to “categorize” my unlabeled images in order to load them to my CNN. For the labeled images I use datasets.ImageFolder() and DataLoader() so I can load them for training.
Thanks for the help!
PS: I thought to save them in a different folder named as “Unlabeled” but I am afraid that it is gonna use the name of the folder as a new class and this is something that it’s gonna ruin the predictions in training as well in testing

ptrblck · January 25, 2023, 8:26pm

You are right that the ImageFolder would create a new class index for this Unlabeled folder, but you could also replace it with your desired “invalid class index” and remap the other class indices if needed.
This would allow you to use the plain ImageFolder dataset with a DataLoader without writing a custom Dataset.

GioSfat · January 26, 2023, 4:09pm

but you could also replace it with your desired “invalid class index” and remap the other class indices if needed

By this you mean that I can change the name of the class inside of ImageFolder() command? Or there is another way by characterize these images as unlabeled?

ptrblck · January 26, 2023, 8:02pm

Yes, you could try to change the class indices in the internal ImageFolder .targets attribute or even better you could try to remap the class indices in a target_transform.

GioSfat · January 31, 2023, 5:19pm

Hello again!
I still can’t resolve my problem.
I tried to create a new Dataset but the main question remains. Even if I changed the name of the class as an empty string it still revognizes the classes giving them indices. Is there a way to “delete” classes and class indices, so when I try to print for example an image as:

[tensor([[0.7176, 0.7176, 0.7176, …, 0.8863, 0.8863, 0.8863],
[0.7176, 0.7176, 0.7176, …, 0.8863, 0.8863, 0.8863], ]), tensor([3, 2])]

to give me only the tensor without the second part tensor([3, 2]) ?
PS: the first tensor part is the image and the second is the class indices for the batch of two.

class CustomDataset(Dataset):
    def __init__(self, path):
        self.imgs_path = path
        file_list = glob.glob(self.imgs_path + "*")
        #print(file_list)
        self.data = []
        for class_path in file_list:
            class_name = ' '
            for img_path in glob.glob(class_path + "/*.jpg"):
                self.data.append([img_path, class_name])       
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        img_path, class_name = self.data[idx]
        img = (cv2.imread(img_path)).float()
        class_id = class_name
        img_tensor = torch.from_numpy(img)
        img_tensor = img_tensor.permute(2, 0, 1)
        return img_tensor, class_id

ptrblck · January 31, 2023, 5:56pm

No, there is no clean way of returning a valid target value for some samples and nothing for others since this would create inconsistent batches of targets.
You could thus completely ignore the unknown classes and remove them from the dataset or return a “placeholder” target and deal with these samples somehow during training.