How to train with imbalance class for classification?

John1231983 · August 20, 2018, 1:35am

Hello all, I have read some related question about imbalance classification; however, I did not find the answer. My dataset is imbalance class that shows the distribution in below

I am using WeightedRandomSampler to handle the above problem. First, the train_labels is label of training set it likes train_labels=[0, 2, 1, 2, 4, 2, 4, 3, 5...]

class_sample_counts=np.unique(train_labels, return_counts=True)[1]
weights = (1 / torch.Tensor(class_prob))
weighted_sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(train_labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, sampler=weighted_sampler)

The code works but the top-1 accuracy of training and validation goes 100% with Imagenet pretrain very fast, while the the testing accuracy very bad (30%). If I did not use sampler way, the accuracy of testing set is 70% and the accuracy of training and validation slowly grow-up. What is happen in my code? Thanks

ptrblck · August 20, 2018, 8:55am

I think your issue might be related to this one.
Try to pass the weights for each sample to WeightedRandomSampler and see if it’s working.

John1231983 · August 20, 2018, 11:27am

Thanks for your reply. I have looked at the code. Is train_targets same as train_labels in my question?

ptrblck · August 20, 2018, 11:29am

Yes, should be the same.

John1231983 · August 25, 2018, 4:06am

I was successful to run it. However, it only support number of worker 1, while without sampling I can use number of worker up to 10. Do we have any way to speed up training? I feel without sampler, my code run 5 times faster

sinAshish · March 4, 2021, 4:46pm

what should be set as num_samples in the WeightedRandomSampler.?
Say that I have a 10 classes and I’ve initialized the weights as the inverse of the count of each class.
I know that I can’t set it as batch_size.

ptrblck · March 5, 2021, 5:39am

The weights should be assigned to each sample, not only the inverse class count as described in the previous post.
Usually the num_samples are then set to the length of the Dataset, i.e. the number of samples.