Sampling batches


I have a case for image to image translation, where I want to use two classes of images (landscapes and faces), but I want to randomize batches of images in a way that classes are not mixed.

For example, landscape images should be used 4 out of 5 times and faces only 1 out of 5.

I haven’t found much of references for a case like this, I read a comment about using ConcatDataset to combine the two classes in a single dataset and then use ConcatDataset.cumulative_sizes to get the boundaries.

I’m planning to use a custom sampler and randomly select the index range for either class, but is there any better way to do this? Or any recommendation or pointers?

In case the class distribution (4 landscape samples + 1 face sample) is not a hard requirement, you could use a WeightedRandomSampler and specify the sample weights in such a way that the desired distribution would be used in the “average batch”.
However, if the distribution is fixed, your approach seems like the best one, i.e. create a custom sampler and yield the precomputed indices to the Dataset.

Thanks for your feedback! I did write the sampler and it works as expected, but it needs optimization now since I’m using loops of lists to get the indexes and it’s very slow, but I’ll take care of that later :smiley: .

WeightedRandomSampler would work if I wanted to do oversampling and have mixed images in the same batch, but I want to keep the classes separate in this case. Sadly I didn’t find a way to make it do what I needed.