I have a list of patient symptom texts that can be classified as multi label with BERT. The problem is that there are thousands of classes (LABELS) and they are very imbalanced.
The patient reports headache and fatigue
Here are some approaches I am considering:
1.OneVsRest Model + Datasets: Stack multiple OneVsRest BERT models with balanced OneVsRest datasets. Problem with it is that it is HUGE with so many stacked models. Additionally, pytorch doesn’t recognize the individual models as parameters when I assign them as a dictionary of layers.
2. OneVsRest Datasets: One for each individual outcome and then feed all to the same model with an additional entry denoting what outcome to predict.
3. Class weights: I tried this in the past and it didn’t seem to work well. Is it possible to have data that is too imbalanced for class weights?
- As I understand, smote ml does not work for multilabel