I have an imbalanced dataset, which has 3 classes. The major class has 10 times the size of the other 2 minority classes. Say the sizes of each class is [10000, 1000, 1000]. If I use the WeightedRandomSampler, the probability of the minority classes will be better than the major class, which means a batch from a loader will have the same numbers of each class. Does that mean that my dataloader will iterate more times than that of the imbalanced dataset? Suppose I have a batch size of 100, the imbalanced dataset will have 120 iterations((10000+1000+1000) / 100), and with the WeightedRandomSampler, the iteration will be 300?((10000+10000+10000) / 100). Is that correct?
Not necessarily, as you have to provide the number of samples to draw using the
num_samples argument. E.g. if you specify to draw 12000 samples with
replacement=True, some samples of your majority classes will probably not be drawn.
Thank you ptrblck.
You are right. The total number of the data drawn is equal to the size of the dataset, but the number of the majority class decreases and those of minority classes increase.
What if I would like to draw all the images of all majority class? Do I have to count all the numbers of all classes and manually increase the data numbers of minority classes to the number of the majority class? Here is the code snippet
target = torch.tensor(image_datasets['train'].targets)
class_sample_count = torch.tensor([(target == t).sum()
for t in torch.unique(target, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in target])
train_sampler = WeightedRandomSampler(samples_weight, len(samples_weight), replacement=True)
WeightedRandomSampler creates the indices using a multinomial distribution (line of code), you cannot specify exactly how many samples will be drawn from a certain distribution, if
Thank you for you reply.
Actually I am a bit curious of how people train imbalanced dataset in pytorch. Do they create augmented images first or do people really use weightedrandomsampler?
Data augmentation does not replace balancing.
When you apply augmentation techniques to your data samples, you would apply e.g. random transformations to your images. While this randomizes the data domain, it does not change the target in a classification use case.
If you are applying the
WeightedRandomSampler, each batch will be created using the passed weights to the sampler, so that in a balancing use case, each batch should contain approx. the same amount of samples for each class.
Since both approaches do not interfere with each other, you could apply them together.
Thank you ptrblack, I will try to use both.
Then does it mean if replacement=True in the WeightedRandomSampler, then it’s possible that some images from the majority class never get sampled?
Also, in terms of dealing with imbalanced data, do you have any recommendations on how to choose between using WeightedRandomSampler to oversample minority classes, and using the weight argument in Loss functions?
Yes, that might happen, as seen here:
weights = torch.tensor([0.01] * 100)
l = torch.multinomial(weights, 100, True)
I’ve had better results using the
WeightedRandomSampler than using a weight parameter in your loss function but ymmv.
Thanks for the reply!
I’ve seen people proposing weighing the loss function (attributing more weight to the minority class) given by the relative frequencies in each mini-batch.
Can this also be implemented in pytorch?
Yes, you could pass the
weight argument to your loss function, e.g.
But in their proposed methods, weights are different for each mini-batch (since the class counts in each mini-batch are different). For example, in batch 1, the class counts are (10,20,2), and in batch 2 the class counts are (10,15,7). I thought the loss function can only take in one set of weights and that’s it.
If you want to change the class weights based on the current class distribution, you could use a non-reduced loss via
reduction='none' and multiply with your created weight tensor.
Otherwise, the passed weights will be applied to the corresponding class samples.
Could you please give me a code sample of how to do that?
Sure, this code shows how to use a per-batch weighting.
However, note that you will get the same results as directly passing the
weight to the criterion, if you don’t change the weights based on the current batch:
class_weights = torch.tensor([0.1, 0.5, 0.4])
output = torch.randn(10, 3, requires_grad=True)
target = torch.randint(0, 3, (10,))
criterion = nn.CrossEntropyLoss(reduction='none')
loss = criterion(output, target)
loss = loss * class_weights[target]
loss = (loss / class_weights[target].sum()).sum()
weighted_criterion = nn.CrossEntropyLoss(weight=class_weights)
weighted_loss = weighted_criterion(output, target)