Text classification with over 10 thousand classes with at least two features

hello everyone, I want to do a text classification using deep learning.
I have a dataset with at least two characteristics that are texts in the columns. I have more than 10 thousand classes. if anyone can suggest a method.

I have experience in text classification for many years. 10 thousand classes are too much for both text classification and also for pattern recognition. I hope you have enough samples in your dataset for training. LSTM models are known as successful for text classification. I can recommend employing these deep learning approaches.

1 Like

Well, every Seq2Seq task where you predict an output sequence can be considered a classification task where the goal is to predict the next word from a vocabulary. And the vocabulary can easily be of size 10k+. But yeah, it typically require a huge amount of data to generalize well.

1 Like

Hi @akuysal Thanks for your help. I managed to find distinct subsets in the data. But these subsets are very unbalanced. The number of classes within each subset is different from one subset to another. I want to classify the data according to the subsets and then classify the data of each subset according to the classes contained in each subset. Is it possible to apply SMOTE? How big can I estimate the data size for my deep learning model to train better?

I haven’t applied SMOTE (Synthetic Minority Oversampling Technique) before. As far as I understand, you are trying to perform a kind of hierarchical text classification similar to 20 newsgroups dataset. In general, my aim was to propose a framework that also works on imbalanced data. I mostly worked with traditional machine learning algorithms rather than deep learning. Ensemble classification approaches can also be applied to unbalanced data in addition to over/under sampling methods. I couldn’t exactly understand what kind of texts (probably they are short texts) you have. Also, what is the accuracy of the currently running system without applying SMOTE?