How to create dynamic Dataset

ConstantSun · February 3, 2021, 4:29am

Hi, I’m doing active learning for my segmentation project. I have an “IMG folder” containing about 3000 images. My task is: First, loading 500 images of that IMG folder for training a model X; Second, using some constraint ( which is defined by is_collected() function) to collect next data ( also in that IMG folder) for the next training phase, and so on…
What should I do with the loading Dataset?

Reuben_Wenisch · February 4, 2021, 8:06pm

Do you mean batch size of 500? The fundamental idea is a generator wherein each loop you increment say “i” and then continue from there (next 500) when you use the next command.

ConstantSun · February 5, 2021, 7:32am

No, I mean: I want to use 500 images for the first training time (batch size can be any ). After that, using some constraint ( e.g: entropy loss of the image < some alpha ) to collect new data that is still in that original pool dataset => I have to create a new DataLoader? How can I do this?

dazzle-me · February 5, 2021, 10:38am

Well, I think you can maintain some kind of “don’t load this image anymore” list, which will contain img’s names on which your cross_entropy < alpha after training.

Consider an example :

class RestrictedDataset(torch.utils.data.Dataset):
    def __init__(self, img_directory, banned_images):
        self.img_pathes = [image_name for image_name in os.listdir(img_directory) \
                           if not image_name in banned_images]
    def __getitem__(self, index):
        image_name = self.img_pathes[index]
        # somehow load image
        # image = ...
        return image, image_name
    def __len__(self):
        return len(self.img_pathes)

And then during training you will have to discard image_name (because there no use of it during training)

But when when the training is done, and you want to remove some images that pass the threshold, - you’ll have this image_name to add it to banned_images list, hope you get the idea.

P.S. you will have to recreate dataset and dataloader after each training cycle, but I think it won’t bring any overhead in your solution

ConstantSun · February 5, 2021, 7:39pm

Yea, thank you @dazzle-me ! I think the same as you, moreover, I think I should create a json file that contains all the banned images, and this json file will be updated every training phase.