NLP text classification with too many labels

I am using a transformer for text classification. I found that 20 labels cover about 80% of all cases. The problem is that the other 20 percent of cases have hundreds or thousands of labels that occur at a much lower frequency than the top 20 labels.

One method for dealing with this problem would be to consolidate the labels. But I do not think that is possible here. Those 20 units have to be exactly as they are to be useful. Arbitrarily merging other samples into these labels would not be practically useful.

My approach has been to ignore the other 20 percent of cases. I only trained on the top 20 labels and get great performance. My concern is what happens when the model goes to production. My current thinking is that the good performance on the top 20 units would be a great benefit to users. But the model might suggest odd labels to users for the 20% of cases I didn’t train on. I could use threshold probabilities where the user wouldn’t get a prediction if the model wasn’t confident, but I feel like this could backfire. The model might predict a top 20 label with high probability for a non-top 20 label text. Or it may correctly predict it up 20 label with a probability below the threshold.

Does this make sense? I’m wondering since I can’t realistically Capture hundreds or thousands of labels that have very few samples, maybe I could merge them together into a single “other” label? But since they are so different it seems like that wouldn’t make sense.

It seems like all of my googling leads to examples where there are only a handful of labels, or labels can be consolidated. What happens when this doesn’t work?

This is a classic case of class imbalance problem.
What metric do you use for measuring ‘performance’? Your model may have high precision but may have a low recall.
You did mention the problem nicely, but I’m not exactly clear what you are trying to achieve.

  1. A better way could be to modify the sampling technique (oversampling or undersampling) or use a weighted loss function.
  2. Another thing you can try is to use the hierarchical classification. Use the “other” label for those other classes and then use another classifier to further segregate them. Of course, the more levels you have, the more data you’ll need.
  3. Sometimes, I combine a rule-based classifier with a deep learning model to handle similar scenarios.

My validation and test set accuracy are 92%. I used F1 score as well. The micro and weighted F1 scores are about 92 percent. Macro F1 score about 81%. The macro F1 score is brought down by below 50% accuracy on a much less frequently occurring top 20 label. False negatives are not a major concern in my domain.

Thank you for the suggestions. This is great food for thought.