How does the WeightedRandomSampler works?

Ishal_Garg · July 12, 2020, 10:09am

I have a binary classification problem and I have a unbalanced dataset. I decided to use WeightedRandomSampler from torch.utils.data

class_weights = [0.35,0.65]
class_weights_initialize = []
for i in train_readyframe['target']:
class_weights_initialize.append(class_weights[i])
weighted_sampler=WeightedRandomSampler(weights=class_weights_initialize,num_samples=len(class_weights_initiaze),replacement=True)

I have given a weight of 0.35 to the 0th class and 0.65 to the other.
Does it means that the DATALOADER will select 65% of 1st class and 35% of 0th class in a single batch of training data?

KFrank · July 12, 2020, 12:22pm

Hello Ishal!

No, not exactly. To take an extreme example, let’s say that your
dataset consists of 100 samples, all from your 1st class. Then
each batch you generate will only contain samples from your
1st class.

Instead let’s say your dataset consists of 65 samples from your 0th
class and 35 samples from your 1st class. Now each batch you
generate will be on average equally weighted between the two
classes, but any given batch will have random fluctuations away
from being exactly equally weighted.

Best.

K. Frank

Ishal_Garg · July 12, 2020, 1:25pm

Thanks for the reply KFrank!
But what do you mean by " each batch you
generate will be on average equally weighted between the two
classes"
Sorry for the noob question. I am new to data science.

KFrank · July 12, 2020, 2:05pm

Hi Ishal!

Using WeightedRandomSampler with a dataloader will build
batches by randomly sampling from your training set.

Let’s say you have six samples in your dataset, with items 1, 2,
and 3 being from class 0, and items 4, 5, and 6 being from class
1.

Let’s build a batch of four items by rolling a six-sided die four
times.

For one batch you roll 3, 6, 4, 6. This batch will have one class-0
sample, and three class-1 samples – not fifty-fifty. Then you roll
5, 2, 4, 3. This batch will be fifty-fifty.

It’s random whether any specific batch is fifty-fifty, but on average
your batches will consist of 50% class-0 samples and 50% class-1
samples.

Best.

K. Frank

Ishal_Garg · July 12, 2020, 2:07pm

Nice explanation
It will randomly generate 50-50, when the class weights are [0.5,0.5] or it depends on the dataset distribution?

KFrank · July 12, 2020, 2:58pm

Hi Ishal!

As the example in my first reply shows, it clearly must depend
on the dataset distribution.

In short, the probability of drawing a sample of a given class is
proportional to the fraction of that class in your dataset times
the weight you assign to that class.

Best.

K. Frank

Ishal_Garg · July 12, 2020, 3:10pm

Thnaks of the explanation KFrank.
So if I have a dataset that contains only 2-3% of 1 class, then i should prefer giving a higher class weight than the other.