Does WeightedRandomSampler increase interations?

joshhu · July 24, 2019, 2:12pm

I have an imbalanced dataset, which has 3 classes. The major class has 10 times the size of the other 2 minority classes. Say the sizes of each class is [10000, 1000, 1000]. If I use the WeightedRandomSampler, the probability of the minority classes will be better than the major class, which means a batch from a loader will have the same numbers of each class. Does that mean that my dataloader will iterate more times than that of the imbalanced dataset? Suppose I have a batch size of 100, the imbalanced dataset will have 120 iterations((10000+1000+1000) / 100), and with the WeightedRandomSampler, the iteration will be 300?((10000+10000+10000) / 100). Is that correct?

Thank you.

ptrblck · July 24, 2019, 10:53pm

Not necessarily, as you have to provide the number of samples to draw using the num_samples argument. E.g. if you specify to draw 12000 samples with replacement=True, some samples of your majority classes will probably not be drawn.

joshhu · July 25, 2019, 12:15am

Thank you ptrblck.

You are right. The total number of the data drawn is equal to the size of the dataset, but the number of the majority class decreases and those of minority classes increase.

What if I would like to draw all the images of all majority class? Do I have to count all the numbers of all classes and manually increase the data numbers of minority classes to the number of the majority class? Here is the code snippet

target = torch.tensor(image_datasets['train'].targets)
class_sample_count = torch.tensor([(target == t).sum() 
                                   for t in torch.unique(target, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in target])
train_sampler = WeightedRandomSampler(samples_weight, len(samples_weight), replacement=True)

ptrblck · July 25, 2019, 7:35am

Since the WeightedRandomSampler creates the indices using a multinomial distribution (line of code), you cannot specify exactly how many samples will be drawn from a certain distribution, if replacement=True.

joshhu · July 25, 2019, 2:54pm

Thank you for you reply.

Actually I am a bit curious of how people train imbalanced dataset in pytorch. Do they create augmented images first or do people really use weightedrandomsampler?

ptrblck · July 25, 2019, 9:46pm

Data augmentation does not replace balancing.
When you apply augmentation techniques to your data samples, you would apply e.g. random transformations to your images. While this randomizes the data domain, it does not change the target in a classification use case.
If you are applying the WeightedRandomSampler, each batch will be created using the passed weights to the sampler, so that in a balancing use case, each batch should contain approx. the same amount of samples for each class.
Since both approaches do not interfere with each other, you could apply them together.

joshhu · July 26, 2019, 7:54am

Thank you ptrblack, I will try to use both.

rzhang63 · February 20, 2020, 1:48am

Hi ptrblck,

Then does it mean if replacement=True in the WeightedRandomSampler, then it’s possible that some images from the majority class never get sampled?
Also, in terms of dealing with imbalanced data, do you have any recommendations on how to choose between using WeightedRandomSampler to oversample minority classes, and using the weight argument in Loss functions?

Thank you!

ptrblck · February 20, 2020, 2:10am

Yes, that might happen, as seen here:

weights = torch.tensor([0.01] * 100)
l = torch.multinomial(weights, 100, True)
print(l.unique().shape)
> torch.Size([59])

I’ve had better results using the WeightedRandomSampler than using a weight parameter in your loss function but ymmv.

rzhang63 · February 20, 2020, 2:19am

Thanks for the reply!
I’ve seen people proposing weighing the loss function (attributing more weight to the minority class) given by the relative frequencies in each mini-batch.
Can this also be implemented in pytorch?

ptrblck · February 20, 2020, 2:21am

Yes, you could pass the weight argument to your loss function, e.g. nn.CrossEntropyLoss.

rzhang63 · February 20, 2020, 2:26am

But in their proposed methods, weights are different for each mini-batch (since the class counts in each mini-batch are different). For example, in batch 1, the class counts are (10,20,2), and in batch 2 the class counts are (10,15,7). I thought the loss function can only take in one set of weights and that’s it.

ptrblck · February 20, 2020, 2:29am

If you want to change the class weights based on the current class distribution, you could use a non-reduced loss via reduction='none' and multiply with your created weight tensor.
Otherwise, the passed weights will be applied to the corresponding class samples.

rzhang63 · March 7, 2020, 3:47am

Could you please give me a code sample of how to do that?

Thanks!

ptrblck · March 7, 2020, 6:43am

Sure, this code shows how to use a per-batch weighting.
However, note that you will get the same results as directly passing the weight to the criterion, if you don’t change the weights based on the current batch:

class_weights = torch.tensor([0.1, 0.5, 0.4])
output = torch.randn(10, 3, requires_grad=True)
target = torch.randint(0, 3, (10,))

criterion = nn.CrossEntropyLoss(reduction='none')
loss = criterion(output, target)
loss = loss * class_weights[target]
loss = (loss / class_weights[target].sum()).sum()

weighted_criterion = nn.CrossEntropyLoss(weight=class_weights)
weighted_loss = weighted_criterion(output, target)
print(torch.allclose(weighted_loss, loss))
> True