Class imbalance with image segmentation

An18 · September 2, 2020, 7:44pm

I’m writing a multilabel classifier where the targets are a multi-hot encoded vector for 13 classes like below

0              "I'm a category 7 sample"  ...  [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
1              "I'm a category 8 sample"  ...  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
2        "I'm a category 1 AND 6 sample"  ...  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

The Dataset is heavily imbalanced, with the following distribution for 20,000 samples:

{'Cat 1': 450,
 'Cat 2': 364,
 'Cat 3': 37,
 'Cat 4': 334,
 'Cat 5': 630,
 'Cat 6': 1096,
 'Cat 7': 918,
 'Cat 8': 3324,
 'Cat 9': 2053,
 'Cat 10': 532,
 'Cat 11': 1110,
 'Cat 12': 101,
 'Cat 13': 776}

There’s roughly 10,000 unlabelled samples in this dataset.

To adress the imbalance in training, I am trying to use WeightedRandomSampler, based on this answer (Some problems with WeightedRandomSampler) by “ptrblck”

Here’s the relevant code:

# dist.values = [450, 364, 37, 334, 630, 1096, 918, 3324, 2053, 532, 1110, 101, 776, 9967]

weights = [1./v for v in dist.values()]
weights = torch.tensor(weights, dtype=torch.float)

# weights: tensor([0.0022, 0.0027, 0.0270, 0.0030, 0.0016, 0.0009, 0.0011, 0.0003, 0.0005,
#        0.0019, 0.0009, 0.0099, 0.0013, 0.0001])

y = torch.tensor(train_dataset['list'], dtype=torch.long)
# y:
# [   [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
#     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#     ...
#     [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
#     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#     [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],]

samples_weights = weights[y]

# for 17,000 training samples
# samples_weights torch.Size([17000, 13]):
[
[0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0027, 0.0022, 0.0022, 0.0022,
        0.0022, 0.0022, 0.0022, 0.0022],
[0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022,
        0.0022, 0.0022, 0.0022, 0.0022],
...
[0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022,
        0.0022, 0.0027, 0.0022, 0.0022],
[0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0027, 0.0022, 0.0022,
        0.0022, 0.0022, 0.0022, 0.0022],
[0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022, 0.0022,
        0.0022, 0.0022, 0.0022, 0.0022],]

Since my targets are multi-hot encoded, samples_weights will either be valued as weights[0] or weights[1], which is obviously wrong.

Is there any way I can make this work with the multi-hot targets?

Shouldn’t samples_weight be the inplace multiplication between the weights vector and the targets for that sample?
For example, the samples_weight for sample [0] above would be:

[0, 0, 0, 0, 0, 0.0009, 0, 0, 0, 0, 0, 0, 0]

This is all new to me so I appreciate any insight.