Is it okay to match the ratio of labels in the mini-batch from the extremely imbalanced data?

Hi, thank you for your effort in establishing PyTorch.
I have a general question about the sampling strategy in the DataLoader.

When I have a very imbalanced dataset with binary classes for the link prediction task, the basic sampler returns the randomly sampled cases while preserving the original ratio of classes.
(In detail, I have 138,015 edges with label 0 and 8,791 with label 1)
Before making the model training, I checked the ratio of labels in the mini-batch and found that most labels were set to 0. (i.e., [0,0,0,0,0,0,0,0] when batch_size=8)

Although I checked that there are several ways to solve this problem (e.g., WeightedRandomSampler in torch.utils.data, RandomUnderSampler in imblearn.under_sampling, …), I considered matching the ratio of pos(1) and neg(0) labels in each batch by randomly choosing the samples from the major class (i.e., [1,1,1,1,0,0,0,0] when batch_size=8).
But is it the right way to overcome the data imbalance?
As the general meaning of negative sampling in the graph is sampling the edges from the unconnected edges, I’m not sure how to deal with the minor class: is it okay to repeat it during iteration?

Thank you for reading my question.

Hi Songyeon!

Yes, forcing each batch to contain the same number of negative samples as
positive samples is a perfectly legitimate way to sample data when you have
unbalanced binary classes.

(You could also reweight the samples in your loss function, for example,
by using BCEWithLogitsLoss’s pos_weight constructor argument. It is,
however, my theoretical preference to sample the minority class more
heavily, as you propose doing, rather than reweight the samples in the
loss function, unless the number of minority samples in your training set is
so small that batches would often contain duplicates of the same sample.)

It’s worth noting that if you use something like WeightedRandomSampler
so that your batches of eight samples contain on average four positive and
four negative samples, it is unlikely that a given batch would contain no
positive or no negative samples and it is rather unlikely that a given batch
would contain only one sample from a class. So you would still get batches
that are usually reasonably well balanced even if you don’t force each batch
to have exactly four positive and four negative samples.

Best.

K. Frank

1 Like

Thank you for your kind and prompt reply!
It’s very helpful to my task. :+1: