PCA Reduces Acc drastically

Tornike · January 6, 2023, 8:02pm

Hi, It might not be pytorch Question but it is About Machine Learning.

Doing Sentiment Analysis Using Traditional ML Algorithm(Either SVM or RF). So, Without PCA i either overfit or have a very low ACC.

To avoid overfitting i try to use PCA on a text data. This is the code :

vectorizer = CountVectorizer(token_pattern=r'[^\s]+',ngram_range=(1, 2))
vectors = vectorizer.fit_transform(df["Comments"])
svd = TruncatedSVD(n_components=10, random_state=42)
data = svd.fit_transform(vectors) 
for train_index, test_index in skf.split(data, y):
    X_train, X_test = data[train_index], data[test_index]
    y_train, y_test = y[train_index], y[test_index]

    text_classifier = SVC(C=100, decision_function_shape='ovo', gamma=0.01)
    text_classifier.fit(X_train, y_train)
    train_yhat = text_classifier.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    train_scores.append(train_acc)
    test_yhat = text_classifier.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    test_scores.append(test_acc)

In this case, model has low acc. Without PCa it overfits.

What Should i do?

Tornike · January 6, 2023, 9:04pm

Or is there Any other way to avoid Overfitting?

carbocation · January 7, 2023, 3:24am

If you are running PCA on all of your data, you may be (inappropriately) contaminating the training data with information from the validation data. That would be consistent with the observations you have described. (To be clear, I mean with respect to your statement “Without PCA i either overfit or have a very low ACC.” That is the opposite of the title of the post, by the way.)

Tornike · January 7, 2023, 10:20am

To be More clear, PCA reduces ACC when i overfit the data without PCA.

i’m fitting different hyperparameters. some have very low acc. 1-2 overfits.

when it overfits i use PCA and it reduces the ACC drastically.

Now i’m trying to use Linear Discriminant Analysis(LDA) which is another Dimensionality reduction but i could not understand how to prepare data for that. LDA of course can’t read string, it should be digits. but once i try to vectorize the data and then apply it to LDA i have a Memory Crash In Colab.