Use Weighted Random Sampler for Imbalanced Class

DJ_1992 · June 5, 2020, 10:19am

Hi,
I am trying to use WeightedRandomSampler in this way

class_sample_count = [39736,949, 7807]
weights = 1 / torch.Tensor(class_sample_count)
weights = weights.double()
sampler = torch.utils.data.sampler.WeightedRandomSampler(
weights=weights,
num_samples=?,
replacement=False)
dataloaders = {x: torch.utils.data.DataLoader(image_datasets, drop_last=True, sampler = sampler, batch_size=32) for x in [‘train’, ‘valid’]}

I want the minority class samples atleast once … what num_samples should I use… also am I using it the right way as I am seeing all the samples from only the minority class? Thank You in advance …

ptrblck · June 5, 2020, 10:25am

The weights tensor should contain a weight for each sample, not the class weights.
Have a look at this post for an example.

DJ_1992 · June 5, 2020, 12:01pm

Okay I understand,
Thanks again @ptrblck… Your Sample code in the link helped alot …

Puneesh_Khanna · May 8, 2022, 1:16pm

@ptrblck - I m using weighted random sampler for an imbalanced class problem. I have a doubt regarding usage of replacement parameter. Does passing it as False ensure that there are no repetitions of samples within the batch size. All my class samples are more than batch size.

ptrblck · May 15, 2022, 6:00pm

replacement=False will not draw the same samples during the entire epoch, so while the first batches might have a balanced class distribution, the latter ones will yield more samples of the majority classes.
Thus, I don’t think using replacement=False is a proper way to balance the data batches.

ramazan · August 6, 2024, 2:07pm

@ptrblck Isn’t it important to state that with replacement=False, the sampler doesn’t consider weights as one might expect?

For example, I have a dataset with 1000 samples: 300 cats and 700 dogs. I assign weights as [1/300, 1/700] and set the dataset size to twice the size of the minority class (300*2=600). Using replacement=False, I would expect to get a balanced subset each epoch containing approximately 300 cats and 300 dogs. However, I don’t actually get a balanced subset.

While I can set replacement=True, it leads to duplicating samples within an epoch, which I want to avoid.

ptrblck · August 7, 2024, 6:36am

Using replacement=False allows you to undersample the dataset (you would need to stop iterating the dataset in this case) since the weights will still be used, but the sampler will be forced to pick imbalanced samples towards the end of the dataset. Without duplicating samples it will be impossible to create the same number of batches (as the full dataset will yield) in a balanced way.

ramazan · August 7, 2024, 9:48am

After reviewing the source code, I understand why the imbalance occurs at the end of the dataset. This is due to the behavior of torch.multinomial.

Yes, that’s why I’m setting num_samples=2*num_minor_class. This theoretically allows for sampling balanced classes without duplicates. Initially, I thought this would happen based on the sampler’s documentation.

I recommend adding a note near the weights argument indicating that weights may not be used as expected when replacement=False.

ptrblck · August 13, 2024, 9:48pm

I believe they are used as expected. However, when no replacement is allowed, the used sample cannot be picked anymore. Higher weights will still indicate a higher likelihood to be selcted.