Using sklearn preprocessing in utils.data.dataset

barthelemymp · June 18, 2019, 4:08pm

Hello,

I’d like to add some pre-processing on my data. My data are just a list of features. For example I would like to do a polynomial expansion. sklearn propose PolynomialFeatures.
However I’m wondering which way would be the best to apply if.

Either in the __init__ of my dataset:

class MyDataset(data.Dataset):


    def __init__(self, csv_file, transform=None):

        self.data = pd.read_csv(csv_file)
        ...
        self.Xs = self.data.iloc[:,1:-1]
        poly = PolynomialFeatures(2)
        self.Xs = poly.fit_transform(self.Xs)



        self.labels = self.data.iloc[:,-1]
        self.transform = transform

Or as a transform in my getitem, which means I neead to define a transform:

class Polynomial_exp(object):


     def __init__(self, degree):
        self.degree = degree
        self.poly = PolynomialFeatures(2)
        


     def __call__(self, sample):
        
        
        
        
        sample = {'X': X, 'label': Y}
        X, Y = sample['X'], sample['label']
        ... (change to numpy, reshape as a 2dim matrice)
        X = self.poly.fit_transform(X)

        return {'X': X, 'label': Y}

Indeed it the sklearn function can work on the entire dataset, and in the second option I’m calling it for each sample. However this remark can be done for all transform we add in our custome dataset. So I’m wondering the point of it and when we should use each solution.

Regards

Barthélémy

Nikronic · June 18, 2019, 7:53pm

Hi,

As far as I know based on model’s I have implemented or source codes I have read in github ,etc, developers usually use their preprocessing step using custom defined class and in __getitem__ method.

Actually, I have implemented 2 different classes for different tasks, one using Native python and other using third-party libraries and both works fine.

So I go for the second method.