Hello ,
New to PyTorch and deep learning I have Four questions :
1- Is it better to scale all your data then split to training and testing sets or do it otherwise.
2- How to use effectively use data augmentation
3- I have missing values in my datasets NaN I just removed the lines where they appeared (8 out of 9230) so im my opinion they don’t have too much impact or do they.
4- I’m doing classification using different types of neural nets , I have 12 classes in total but some are too rare compared to other , is there a way to make the model more robust to rare classes.
And Thank you.
-
I assume scaling is done for normalization. If so, you should split your data first and then scale it. Otherwise you will leak the test information into your training procedure.
-
Have a look at the Data loading tutorial. Basically you create a
Dataset
and add your data augmentation in the__getitem__
function. TheDataLoader
can use multi-processing to speed up the loading and transformations. -
It really depends on the data. Since you only remove 8 out of 9k, it seems not to be that bad. Alternatively, try to fill the
NaN
s with the mean or median of the feature from your training set. -
You can use the
weight
argument for different loss functions (see here). Alternatively, you can over- or undersample your classes using WeightedRandomSampler. I’ve created a small tutorial on this topic, which is not released yet, but might give you a good start.
Thank you for your help.