Data Scalling , Data Augmentation , Data Interpolation

Hello ,
New to PyTorch and deep learning I have Four questions :
1- Is it better to scale all your data then split to training and testing sets or do it otherwise.
2- How to use effectively use data augmentation
3- I have missing values in my datasets NaN I just removed the lines where they appeared (8 out of 9230) so im my opinion they don’t have too much impact or do they.
4- I’m doing classification using different types of neural nets , I have 12 classes in total but some are too rare compared to other , is there a way to make the model more robust to rare classes.
And Thank you.

  1. I assume scaling is done for normalization. If so, you should split your data first and then scale it. Otherwise you will leak the test information into your training procedure.

  2. Have a look at the Data loading tutorial. Basically you create a Dataset and add your data augmentation in the __getitem__ function. The DataLoader can use multi-processing to speed up the loading and transformations.

  3. It really depends on the data. Since you only remove 8 out of 9k, it seems not to be that bad. Alternatively, try to fill the NaNs with the mean or median of the feature from your training set.

  4. You can use the weight argument for different loss functions (see here). Alternatively, you can over- or undersample your classes using WeightedRandomSampler. I’ve created a small tutorial on this topic, which is not released yet, but might give you a good start.


Thank you for your help.