Effective way of weighted sampling from different folders

Dana_Kim · July 16, 2021, 1:31am

Hi,

I’m trying to train a model with custom datasets.

I need to select a folder to be sampled from according to probability.

For example, the images should be sampled from folder A with 70% of probability and 30% from folder B.

This is the code I wrote but I think it’s not effective:

class CustomDataset(Dataset):
    def __init__(self, img_dir, img_dir2, ratio):
        self.ratio = ratio
        self.img_list = sorted(glob(os.path.join(img_dir, '**/Image_*'), recursive=True))
        self.img_list2 = sorted(glob(os.path.join(img_dir2, '**/Image_*'), recursive=True))

    def __getitem__(self, index):
        if np.random.random() < self.ratio:
            index = np.random.randint(len(self.img_list))
            img = Image.open(self.img_list[index])

        else:
            index = np.random.randint(len(self.img_list2))
            img = Image.open(self.img_list2[index])

        img = self.preprocess(img)
        return torch.from_numpy(img).type(torch.FloatTensor)

    def __len__(self):
        return len(self.img_list) + len(self.img_list2)

Is there more effective way to perform this task?

I guess I need to use sampler but can’t figure out how to use it.

Thanks in advance

ptrblck · July 16, 2021, 6:37am

You could use a WeigthedRandomSampler, in case you could calculate and assign the weight for each sample corresponding to the folders. This would have the advantage to compute the weights only once and then sample from the two folders.