Multilabel classification under unbalanced class distributions

(Dim Trigkakis) #1

Hello everyone,

I am currently trying to retrain a classifier for Pascal Voc 2012 based on vgg11. The pretrained network loads fine. I have a fully connected layer from the 4096-dim feature vector to my 20 classes. Now I have tried to train a randomly generated dataset (where an image of random noise gets mapped to [1 0 … 0 1] and a white image gets mapped to [0 1 … 1 0]).

This part works fantastically well with multi label losses in pytorch. However in pascal voc the person class is overrepresented. This makes the network output [ 1 0 … 0 0 ] essentially classifying all images as containing a person and no other classes. Calculating accuracy for both the above cases was done as suggested in “Calculating accuracy for multi-label classification” in the forums, and it worked well.

I have looked in the forums but there is nothing about this problem, what would one need to do to be able to simply train the network on the imbalanced pascal voc 2012 data?

Thank you


See if this comment might help. You can use some kind of weighted sampling to rebalance the class distributions when sampling from the dataset: How to Prevent Overfitting

(Dim Trigkakis) #3

I have tried using weighted sampling, but that doesn’t help. Assuming that two classes which are usually exclusive have a low chance, they will get assigned greater weight. But that happens for all 19 out of 20 classes, so that they are all now equiprobable, increasing the probability of being picked from 1/20 to 1/19.

Weighting the loss hasn’t helped either, even when I only weighted the false positives.

My code uses vgg16 and I do not fine-tune on Pascal Voc 2012:

model = VGG(make_layers([64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512,   512, 'M']))
for param in model.features.parameters():
    param.requires_grad = False

new_classifier = nn.Sequential(*list(model.classifier.children())[:-1])
model.classifier = new_classifier
model.classifier.fc = nn.Linear(4096,20)

optimizer = optim.SGD(model.classifier.fc.parameters(), lr=0.0001)

multi_label_loss = nn.MultiLabelSoftMarginLoss()

(Zhicheng Huang) #4

Have you addressed this problem? If you have solved this problem, can you tell me some details? Thank you.

(Ikram CHAABANE) #5

There is a new asymmetric entropy improving the classification of imbalanced data. Please find details in