Stratified dataloader for imbalanced data

jasperhyp · June 22, 2022, 4:44pm

Hi,

I reviewed previous posts on this topic and found that most answers seem to aim for building a balanced batch instead of keeping the original class distribution, e.g. this one and that one. It seems this might not be a very good practice because oversampling the innately imbalanced distributions might create a bias. I was just wondering if there is a functionality similar to the “stratify” option in sklearn.model_selection.train_test_split, where the class distribution is kept in each stratum, i.e. batch.

Please feel free to comment on this argument & question!

nivek · June 22, 2022, 6:41pm

If I am understanding correctly, you want to preserve the original class distribution. If you want each batch to approximately preserve the original distribution, RandomSampler should be helpful (or random_split to split a dataset into two). Since those are random, the distribution will not be exact across batches/datasets but should approximately be.

If you want each batch to have the exact distribution, you can consider writing your own Sampler or use the sklearn function.

Let me know if that is helpful. If not, you can provide some minimal code snippets (or sample data distribution) that you are working with, and we can have a look.

jasperhyp · June 22, 2022, 6:50pm

Thanks! I think that what I want is just ensuring each batch having the random probability and at least 1 minority sample. Would write a simple class to realize this.