How to balance data in PyTorch DataLoader

SarahTeoh · January 16, 2021, 12:59pm

Hi, I currently have train data that is imbalanced.
Distribution of the train data:

I want to adjust the data so that every range has at least 50 samples.
For example, 0~0.25 has 50 samples, 0.25~0.5 has 50 samples and so on.

How can I do that?
I know PyTorch DataLoader has BatchSampler that can be used to sample an equal number of samples from each class, but the sampler uses class labels while my data is not class label.

ptrblck · February 1, 2021, 8:38am

You could use a WeightedRandomSampler and try to adapt this example for your use case.
Based on the figure it seems you are working on a regression task, so you would need to create bins first before calculating the “class weights” (which would be bin weights in your case).