Preprocessing using StandardScaler and PCA

piaoliusihai · April 16, 2018, 4:43am

I need to split my data to training set, dev set and test set.
Then I need to use StandardScaler() and PCA to preprocess my data.
Here comes the question:

Should I use PCA before StandardScaler() and before splittig my data to training set, dev set and test set?
Should I use StandardScaler() to fit my training set, then transform my dev set and test set? or to fit my training set and dev set, then transform my test set.

Thanks a lot for your reply.

ptrblck · April 16, 2018, 7:44am

You should scale your features before applying PCA. Using PCA we would like to get components which maximizes the variance. Depending on your features the feature scales might mislead the PCA and thus yielding to “wrong” components (e.g. distance in kilometers vs. humidity in %). You shouldn’t fit a teachable method on all datasets.
The first approach is right. You fit the scaler on the training set and transform the dev and test set. The dev error is an approximation of the final test error, so you should transform both sets identically. During your model training you can estimate the error on the dev set and e.g. perform early stopping based on this error. The test error is only calculated once your model is ready.

piaoliusihai · May 19, 2018, 8:56am

Thanks so much!
Appreciating your reply!