I am using a transformer for text classification. I found that 20 labels cover about 80% of all cases. The problem is that the other 20 percent of cases have hundreds or thousands of labels that occur at a much lower frequency than the top 20 labels.
One method for dealing with this problem would be to consolidate the labels. But I do not think that is possible here. Those 20 units have to be exactly as they are to be useful. Arbitrarily merging other samples into these labels would not be practically useful.
My approach has been to ignore the other 20 percent of cases. I only trained on the top 20 labels and get great performance. My concern is what happens when the model goes to production. My current thinking is that the good performance on the top 20 units would be a great benefit to users. But the model might suggest odd labels to users for the 20% of cases I didn’t train on. I could use threshold probabilities where the user wouldn’t get a prediction if the model wasn’t confident, but I feel like this could backfire. The model might predict a top 20 label with high probability for a non-top 20 label text. Or it may correctly predict it up 20 label with a probability below the threshold.
Does this make sense? I’m wondering since I can’t realistically Capture hundreds or thousands of labels that have very few samples, maybe I could merge them together into a single “other” label? But since they are so different it seems like that wouldn’t make sense.
It seems like all of my googling leads to examples where there are only a handful of labels, or labels can be consolidated. What happens when this doesn’t work?