@ptrblck, I am trying to use WeightedRandomSampler for handling imbalance in the dataset. However, the intuition behind it is not clear to me. My target labels are in form of one-hot encoded vectors as below.
train_labels.head(5)
|
none |
infection |
ischaemia |
both |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
2 |
0 |
1 |
0 |
0 |
3 |
0 |
1 |
0 |
0 |
4 |
0 |
1 |
0 |
0 |
Below are the steps, I used to calculate for the weighted random sampler. Please correct me if I am wrong with the interpretation of any steps.
- Count the number of samples per class in the dataset
class_sample_count = np.array(train_labels.value_counts())
class_sample_count
array([2555, 2552, 621, 227])
- Calculate the weight associated with each class
weight = 1. / class_sample_count
weight
array([0.00039139, 0.00039185, 0.00161031, 0.00440529])
- Calculate the weight for each of the samples in the dataset.
samples_weight = np.array(weight[train_labels])
print(samples_weight[1], samples_weight[2] )
[0.00039185 0.00039139 0.00039139 0.00039139] #label 0 in actual data
[0.00039139 0.00039185 0.00039139 0.00039139] #label 1 in actual data
The dimension of samples_weight
comes to be [5955, 4]. 5955 are the total no. of images in the original set, and 4 corresponds to the total number of classes.
Now how this mapping has been done? Since class weight
for class 0 is 0.00039139 (obtained in step 2). How were the rest of the three entries picked up for class 0?
- Convert the np.array to tensor
samples_weight = torch.from_numpy(samples_weight)
samples_weight
tensor([[0.0004, 0.0004, 0.0004, 0.0004],
[0.0004, 0.0004, 0.0004, 0.0004],
[0.0004, 0.0004, 0.0004, 0.0004],
...,
[0.0004, 0.0004, 0.0004, 0.0004],
[0.0004, 0.0004, 0.0004, 0.0004],
[0.0004, 0.0004, 0.0004, 0.0004]], dtype=torch.float64)
After conversion to tensor, all the samples appear to have the same value in all four enteries? Then how does Weighted Random Sampling
is oversampling the minority class?
I will be grateful for any leads. Thank you.