Balancing a multilabel dataset

Boriss_Sidjakins · June 17, 2023, 9:16pm

Hello,

I have a dataset that contains email texts and their corresponding labels. Each email can have multiple labels, making it a multilabel problem. I used multi-hot encoding for labels, so they look like [1, 0, 1, 0, 0], where 1 indicates that an email belongs to that class. However, my data is imbalanced, and some label combinations barely occur in the dataset, which makes my LSTM biased towards the majority classes. I tried specifying pos_weight for BCEWithLogitsLoss, but that doesn’t seem to help. What are some balancing techniques that I could use?