Dataset/Dataloader minibatch class distribution

I am trying to find out if there is any way to force the distribution of classes in each batch that is produced when using the pytorch Dataset and Dataloader functionality. For example, I am doing binary classification and (because my class sizes are imbalanced) during training I would like each batch to be 50% positive examples and 50% negative. Is there any way to achieve this with Dataloader?
Thanks!

I think the proper way to have such a functionnality is to subclass torch.utils.data.Sampler with a custom one that samples equiprobably from positive and negative samples. Then, you pass an instance of this custom sampler to the DataLoader as the sampler argument.

In any case, however, you will need to decide if you want to repeat the smaller class (so that you visit every element of the larger class once per epoch) or to trim the larger class (so that you visit every element of the smaller class once per epoch). In the first case, data augmentation is recommended.

Thank you, that’s exactly what i was looking for :grin:

Personally, I use the WeightedRandomSampler class (https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler) and give each sample a weight of inverse the frequency of each class, thus resulting in an equal representation of the classes in the mini-batch (assuming a big enough batch size I guess)

I want to clarify one point regarding the WeightedRandomSampler.
While it is oversampling the minority class it is also undersampling the majority class .
Lets say i have 100 images of classA and 900 images of classB
Then dataloader length will be 1000. and when we will iterate in minibatches it will ensure equal distribution thus approx 500 images of class A and 500 images of classB will be used for training.
Can’t we say it is oversampling the minority but undersampling the majority in dataset?

Answered here.