I have a binary classification problem and I have a unbalanced dataset. I decided to use WeightedRandomSampler from torch.utils.data
class_weights = [0.35,0.65] class_weights_initialize = [] for i in train_readyframe['target']: class_weights_initialize.append(class_weights[i]) weighted_sampler=WeightedRandomSampler(weights=class_weights_initialize,num_samples=len(class_weights_initiaze),replacement=True)
I have given a weight of 0.35 to the 0th class and 0.65 to the other. Does it means that the DATALOADER will select 65% of 1st class and 35% of 0th class in a single batch of training data?
No, not exactly. To take an extreme example, let’s say that your
dataset consists of 100 samples, all from your 1st class. Then
each batch you generate will only contain samples from your
1st class.
Instead let’s say your dataset consists of 65 samples from your 0th
class and 35 samples from your 1st class. Now each batch you
generate will be on average equally weighted between the two
classes, but any given batch will have random fluctuations away
from being exactly equally weighted.
Thanks for the reply KFrank!
But what do you mean by " each batch you
generate will be on average equally weighted between the two
classes"
Sorry for the noob question. I am new to data science.
Using WeightedRandomSampler with a dataloader will build
batches by randomly sampling from your training set.
Let’s say you have six samples in your dataset, with items 1, 2,
and 3 being from class 0, and items 4, 5, and 6 being from class
1.
Let’s build a batch of four items by rolling a six-sided die four
times.
For one batch you roll 3, 6, 4, 6. This batch will have one class-0
sample, and three class-1 samples – not fifty-fifty. Then you roll
5, 2, 4, 3. This batch will be fifty-fifty.
It’s random whether any specific batch is fifty-fifty, but on average
your batches will consist of 50% class-0 samples and 50% class-1
samples.
As the example in my first reply shows, it clearly must depend
on the dataset distribution.
In short, the probability of drawing a sample of a given class is
proportional to the fraction of that class in your dataset times
the weight you assign to that class.
Thnaks of the explanation KFrank.
So if I have a dataset that contains only 2-3% of 1 class, then i should prefer giving a higher class weight than the other.