Balancing a multilabel dataset

Hello,

I have a dataset that contains email texts and their corresponding labels. Each email can have multiple labels, making it a multilabel problem. I used multi-hot encoding for labels, so they look like [1, 0, 1, 0, 0], where 1 indicates that an email belongs to that class. However, my data is imbalanced, and some label combinations barely occur in the dataset, which makes my LSTM biased towards the majority classes. I tried specifying pos_weight for BCEWithLogitsLoss, but that doesn’t seem to help. What are some balancing techniques that I could use?

Balancing the data in a multi-label use case is not trivial and this post describes some methods for a balanced sampling.

How dependent are the labels? If they are (kind of) independent, would training 5 binary classifiers be an option?

I’m not sure how to measure their dependency, but for context, I’m classifying emails based on what architectural (software) design decisions they contain. Here is the distribution of labels in my training dataset (not-ak refers to emails that don’t contain any architectural decisions):

I’m new to machine/deep learning, so I’m not sure if training 5 binary classifiers would be helpful in this case.

Well, I would indeed consider 5 binary classifiers here, as Yes/No classifier for each of for 5 class labels.

For example if a item item1 is labeled “process, property, technology”, and an item item2 is labeled as “process, technology” it can be views as multiclass data as follow:
(item1, process)
(item1, property)
(item1, technology)
(item2, process)
(item3, technology)

If you now train the, say, “property” classifier, you put item1 in the positive class and item2 in the negative class. For training the “process” and “technology” classifiers, of course both item1 and item2 are in the positive class

For inferencing, you pump a new item through all 5 classifiers. If the “property” classifier and the “technology” classifier say but not the others, you label the new item with “property, technology”.

Suppose we have all of your targets(all_targets) in a tensor of size [num_targets, num_classes].

First, sum along the num_targets:

total_by_class = all_targets.sum(0)

Second, use this to generate a weights vector:

weights = total_by_class.mean()/total_by_class

Third, set empty classes back to 0.

cond = torch.isinf(weights)
weights[cond] = 0

Finally, pass that into your loss function:

criterion = nn.CrossEntropyLoss(weight=weights)

What that will do is cause the loss function to scale each class by a multiplier, by the inverse of their representation. I.e. classes with less examples get scaled more.