Hello,
I am a bloody beginner with pytorch. Currently, I am trying to build a CNN for timeseries. The goal is to stack m similar time series into a matrix at each time step, always looking back n steps, such that the feature matrix at each time t has shape m x n. Before feeding these feature matrices into a Conv2d network, I still want to normalize them by for instance minmax-scaling or last-point-scaling. For instance for minmaxscaling, each individual time series shall be scaled to [0,1] over its timespan. In order to do that, I wrote a custom dataset. I precompute the running mins and maxs with pandas, and then do the scaling. My question is whether that would be the typical way to do this in pytorch? Or is there a better way. It is considerably fast, so I am satisfied, but I am just asking, because I want to learn to do the things the right way in pytorch.
class CustomDataset2D(Dataset):
# lookback window dataset for CNN training
def __init__(self, features, labels, lookback: int, transform=None, target_transform=None):
# data and lookback
self.features = features.reindex(labels.index)
self.labels = labels
self.lookback = lookback
# feature transform
self.transform = transform
if self.transform is not None:
self.transform = getattr(self,self.transform)
if hasattr(self,'prepare_%s'%transform):
getattr(self,'prepare_%s'%transform)()
# label transform
self.target_transform = target_transform
# to numpy
self.features = self.features.values.astype(np.float64)
self.labels = self.labels.values.astype(np.float64)
return
def prepare_minmaxscaler(self):
# must be called on pd.DataFrame
self.mins = self.features.rolling(self.lookback,center=False,min_periods=self.lookback).min().shift(-self.lookback+1).values.astype(np.float64)
self.maxs = self.features.rolling(self.lookback,center=False,min_periods=self.lookback).max().shift(-self.lookback+1).values.astype(np.float64)
self.diffs = self.maxs - self.mins
return
def minmaxscaler(self,x,idx):
return (x-self.mins[idx])/self.diffs[idx]
def __len__(self):
return len(self.features) - self.lookback + 1
def __getitem__(self, idx):
feature2d = self.features[idx:idx+self.lookback,:]
label = self.labels[idx+self.lookback-1]
if self.transform:
feature2d = self.transform(feature2d,idx)
if self.target_transform:
label = self.target_transform(label)
return feature2d, label
The inputs features and labels are pd.DataFrames. As you can see, if the parameter transform=‘minmaxscaler’, the transformation defined in the method minmaxscaler is applied when getitem is called. In order to accelerate this, instead of computing the min and max on-the-go, they are precomputed and stored inside the dataset object by means of the function prepare_minmaxscaler.
Would that be a good way to do it, or is this typically done differently?
I appreciate your response!
Best,
JZ