Imbalanced sampling of WeightedRandomSampler when given same weights

CodeOfKkkkken0502 · November 21, 2023, 10:53am

I’m using WeightedRandomSampler for a long-tailed dataset to get more balanced samples. It works okay, but quite a few samples are not selected. So I give less imbalanced weights to cover more of the dataset, but still many samples are never selected. Then I try the same weight for every sample, but it still samples imbalancedly.
For example,

samples = list(torch.utils.data.WeightedRandomSampler([0.01 for _ in range(100)], 100, replacement=True))
samples_counter = collections.Counter(samples)

And len(samples_counter) is usually around 66 (2/3 * 100), which means only 2/3 of all samples are sampled even they share the same weight.
Changing the number of samples, I still got similar result that only 2/3 of all samples are chosen. But I’m expecting that roughly all samples can be selected because they are given the same weights. So how can I cover more of the dataset with the WeightedRandomSampler?
Thanks for any help.

tom · November 21, 2023, 12:10pm

This is an effect of the statistics of sampling 100 values randomly from 1…100. If you insist on WeightedRandomSampler, you could sample more values or just run it for more epochs.
My go-to solution is to duplicate examples from rarer classes at the Dataset level. This requires some care when using BatchNorm or so, but usually works quite well.

Best regards

Thomas

CodeOfKkkkken0502 · November 21, 2023, 1:19pm

I see. Thanks for your answer!