Big difference between Train and Test ACC

Hi, forget about the tittle.

Doing sentiment analysis on comments. i have Labeled The text data By hand, one by one. Then trained it and had 87% f1 Weighted score on the train Set. i Checked it on unknown data for the model and it did well.

But when i upload it on a contest website it has a 53% f1 weighted score.

IDK how they label the data. but according to how i label the model works perfectly well on unknown data.

Can there be any other reason but incorrect labeling on why the model did not work well on the contest’s data?

What is the number of labels for classification? If it is 2, the definition of f1 is standard. But for labels more than 2, there are different version of f1 (.e.g. micro vs macro)
This is something to check.

Another reason is the data distribution. If there is huge a difference between your training and website data (e.g. in terms of ground truth proportion), this might be another reason.

If the model is small and training data is small (and if you have time and resource), I would suggest you to do cross validation and ensembling just to make sure you decrease the chance of overfitting

  • There are 3 classes. And Macro avg of f1 is 84%. but they ask only the weighted f1.

  • i have website data which i labeled by hand. The problem is, IDK how they label the data. where i have a Neutral class they may have negative or positive and etc… That’s where i see the main problem

  • Okay, thanks. I’ll check the Cross Validation.

Also, do you have access to test input text data. Maybe, you could do some NLP analysis and compare with your text data without requiring ground truth label.

One starting point would be here: Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools -

Another direction (probably you’ve already done) is to start with pretrained Large Language models (that has stronger understanding of underlying text data).
Then, you could train change the regularization amount to see if how it changes (web) test accuracy. This might give you an idea about discrepancy between the way web data is labeled and your labelling method.
For instance, you could increase e.g. L2 regularization, try each of these models for web data.
Obviously, this is not the best way to measure overfitting. But, this is for detecting discrepancy between the labelling methods of web and yours

The problem is that my data is not in English, So, there are no pretrained models in my language. I’m doing the task with Traditional ML(either SVM or Random Forest).

Yes, And once i label that with my model it does not work badly according to how i label my training data(checked around 50 samples). but when i upload it on website the f1 is very low.

IDK, now i’m gonna try this task using bert. but i’m not sure if the free Google colab would be enough to train bert from scratch.

I have seen transformed based models in many languages in tensorflow hub, or hugging face (up to 70-90) maybe you should check those again.

There are also multilingual models in these platforms that performs somewhat decently for low resource languages

Also, training model as large as BERT (even small version) from scratch for small data set (e.g. 10000 sentence of dataset, of course there are other factors e.g. data quality, but this is just an 1st order estimate) would likely lead to overfitting to your training set.

There is no Pretrained Bert on Georgian. or any Pretrained transformers model in Georgian.

What u think i should do? there are 2 reasons why my model did not work on websites’ test data : 1) because i labeled them incorectly, 2) or there is an overfit with SVM model. SVM Linear mostly tends to an overfitting but other kernel types have very low train f1.

Actually, SVM, where F1 on the trainset was 87% on websites’s test set, had 34%. but with the Random Forest algorithm i had 83 on train and 53 on test.

First I would go as long as with traditional ML approaches since they are easier to implement.
For instance, hyperparameter tuning for svm (using random search). In my experience regarding real data, svm with cross validation and ensembling over cross validated models tend to perform better than random forest and also faster to train (with equal range of complexities) The result that you obtained might be highly dependent on the hyperparameter tuning. I would also try boosted trees in addition to RF and SVMs.

Since I dont have access to data, I am not sure about the data preprocessing.
Some general advice:

  • Standardize your text (e.g. lower case get rid of unnecessary punctuation, e.g. you’re → you are)
  • Get rid of unnecessary entities to make your text less noisy (tweet hashtag in the context of tweet), this is obvious highly context dependent

If you are not satisfied with these and if you have a time period 1-2 months and quality unlabeled Georgian text data and you are looking for some NLP Pretraining adventure, you can train BERT like model from scratch. If you have these conditions, I think it would be a good learning opportunity as I also observed there was only 1 project on Georgian language, which was far from complete project

Hi again, Sorry for the late reply.

I don’t have much time and conditions(GPU) to train Bert from scratch.

Also, i’m new in ML. I know Deep Learning but i never trained Any traditional ML algorithm, so this cross-validation and staff are new to me. Hope I’ll figure that out in the next 10 days.

Thank you again. i’m gonna do the task this way. With SVM and Cross-Validation.

Using cross-validation may probably increase the accuracies if you have small amount of data. However, I don’t hope you to obtain a very big improvement in accuracy. It will just cause a proper evaluation if your data distribution is not balanced. I think you may investigate class-based accuracies for test set. So, you can look at confusion matrices. At least you can understand which classes cause low accuracies. Then, you try to optimize parameters of learning algorithm. For example, in linear SVM, you may change cost parameter (defined as c in general).

I did that thank you so much. Now trying to avoid overfitting with either LDA or PCA.

Of course, you may apply feature transformation (LDA/PCA) or feature selection as an additional step. This step may improve accuracy and/or decrease the feature size. However, as far as I know, especially SVMs can handle huge feature sizes. So, you can probably obtain more improvements for other classifiers in comparison to SVMs when you add LDA/PCA.

Thanks. What do you think i should do to avoid overfitting in SVM ? I did HyperParameter Search and best estimator overfitted. I did Data augmentation i have 20k of labeled data. what else i should do to avoid overfitting ?