Text classification with over 10 thousand classes with at least two features

hello everyone, I want to do a text classification using deep learning.
I have a dataset with at least two characteristics that are texts in the columns. I have more than 10 thousand classes. if anyone can suggest a method.

I have experience in text classification for many years. 10 thousand classes are too much for both text classification and also for pattern recognition. I hope you have enough samples in your dataset for training. LSTM models are known as successful for text classification. I can recommend employing these deep learning approaches.


Well, every Seq2Seq task where you predict an output sequence can be considered a classification task where the goal is to predict the next word from a vocabulary. And the vocabulary can easily be of size 10k+. But yeah, it typically require a huge amount of data to generalize well.


Hi @akuysal Thanks for your help. I managed to find distinct subsets in the data. But these subsets are very unbalanced. The number of classes within each subset is different from one subset to another. I want to classify the data according to the subsets and then classify the data of each subset according to the classes contained in each subset. Is it possible to apply SMOTE? How big can I estimate the data size for my deep learning model to train better?

I haven’t applied SMOTE (Synthetic Minority Oversampling Technique) before. As far as I understand, you are trying to perform a kind of hierarchical text classification similar to 20 newsgroups dataset. In general, my aim was to propose a framework that also works on imbalanced data. I mostly worked with traditional machine learning algorithms rather than deep learning. Ensemble classification approaches can also be applied to unbalanced data in addition to over/under sampling methods. I couldn’t exactly understand what kind of texts (probably they are short texts) you have. Also, what is the accuracy of the currently running system without applying SMOTE?

Hi @akuysal sorry for all the time.I used several methods finally I used a Bidirectional LSTM model with more than 5000 (classes) outputs. I get about 49% accuracy.

Hi @Jean1 , I think the accuracy is not bad for your case because there are so many classes. You may analyze confusion matrix in order to understand which classes are difficult to classify. I think it may help you to improve your model parameters.

Thanks @akuysal, But I have a question how to improve the parameters of the model.

How well is each class represented in the data? For example, if 49% of your data is 1 class, the model could just be guessing that class every time. Any class under-represented or over-represented could create a sampling bias. Ideally, you want an even amount in each class and subclass.


Random notes but - you want a baseline to tell you are you above random or “bad”. For example, if there are 4 classes with a 100 row sample and class A has 85 rows, B 10, C 3 and D 2, you can do a bit of analysis. A random classifier will have an accuracy of 25% … On the other hand, a classifier that always predicts A will have 85%. But, if B, C, D are crucial, you need to compute other metrics like Recall or AUC or Precision. Also, you should try to have your train and test distributions almost the same …

1 Like

hi @J_Johnson. Let us assume that the classes are almost in balance. But can such a model be deployed in production?

That would depend on how well the model performs and if that accuracy is suitable to the task.

But my point in the above comment is that if the superficial accuracy metric is applied without a deeper understanding of the train/test data, one could be misled into thinking a model has high accuracy that does not.

Take, for instance, a simple cat/dog classification model. The model gets 80% accuracy. However, upon further investigation, it turns out the data is 20% cats and 80% dogs. And so the model just learned to guess “dog” every time. That model wouldn’t be suitable for anything.

At any rate, if a model is determined to be doing well on a balanced dataset, the use case will determine how well is good enough for that particular use.

1 Like