New to PyTorch and deep learning I have Four questions :
1- Is it better to scale all your data then split to training and testing sets or do it otherwise.
2- How to use effectively use data augmentation
3- I have missing values in my datasets NaN I just removed the lines where they appeared (8 out of 9230) so im my opinion they don’t have too much impact or do they.
4- I’m doing classification using different types of neural nets , I have 12 classes in total but some are too rare compared to other , is there a way to make the model more robust to rare classes.
And Thank you.
I assume scaling is done for normalization. If so, you should split your data first and then scale it. Otherwise you will leak the test information into your training procedure.
Have a look at the Data loading tutorial. Basically you create a
Datasetand add your data augmentation in the
DataLoadercan use multi-processing to speed up the loading and transformations.
It really depends on the data. Since you only remove 8 out of 9k, it seems not to be that bad. Alternatively, try to fill the
NaNs with the mean or median of the feature from your training set.
You can use the
weightargument for different loss functions (see here). Alternatively, you can over- or undersample your classes using WeightedRandomSampler. I’ve created a small tutorial on this topic, which is not released yet, but might give you a good start.
Thank you for your help.