Hello,
I’d like to add some pre-processing on my data. My data are just a list of features. For example I would like to do a polynomial expansion. sklearn propose PolynomialFeatures
.
However I’m wondering which way would be the best to apply if.
Either in the __init__
of my dataset:
class MyDataset(data.Dataset):
def __init__(self, csv_file, transform=None):
self.data = pd.read_csv(csv_file)
...
self.Xs = self.data.iloc[:,1:-1]
poly = PolynomialFeatures(2)
self.Xs = poly.fit_transform(self.Xs)
self.labels = self.data.iloc[:,-1]
self.transform = transform
Or as a transform in my getitem, which means I neead to define a transform:
class Polynomial_exp(object):
def __init__(self, degree):
self.degree = degree
self.poly = PolynomialFeatures(2)
def __call__(self, sample):
sample = {'X': X, 'label': Y}
X, Y = sample['X'], sample['label']
... (change to numpy, reshape as a 2dim matrice)
X = self.poly.fit_transform(X)
return {'X': X, 'label': Y}
Indeed it the sklearn function can work on the entire dataset, and in the second option I’m calling it for each sample. However this remark can be done for all transform we add in our custome dataset. So I’m wondering the point of it and when we should use each solution.
Regards
Barthélémy