Custom Dataset with Min-Max-Scaling

jayz · February 16, 2022, 1:01pm

Hello,

I am a bloody beginner with pytorch. Currently, I am trying to build a CNN for timeseries. The goal is to stack m similar time series into a matrix at each time step, always looking back n steps, such that the feature matrix at each time t has shape m x n. Before feeding these feature matrices into a Conv2d network, I still want to normalize them by for instance minmax-scaling or last-point-scaling. For instance for minmaxscaling, each individual time series shall be scaled to [0,1] over its timespan. In order to do that, I wrote a custom dataset. I precompute the running mins and maxs with pandas, and then do the scaling. My question is whether that would be the typical way to do this in pytorch? Or is there a better way. It is considerably fast, so I am satisfied, but I am just asking, because I want to learn to do the things the right way in pytorch.


class CustomDataset2D(Dataset):
    # lookback window dataset for CNN training
    
    def __init__(self, features, labels, lookback: int, transform=None, target_transform=None):

        # data and lookback
        self.features  = features.reindex(labels.index)
        self.labels    = labels
        self.lookback  = lookback

        # feature transform
        self.transform = transform
        if self.transform is not None: 
            self.transform = getattr(self,self.transform)
            if hasattr(self,'prepare_%s'%transform):
                getattr(self,'prepare_%s'%transform)()

        # label transform
        self.target_transform = target_transform

        # to numpy
        self.features  = self.features.values.astype(np.float64) 
        self.labels    = self.labels.values.astype(np.float64)   

        return

    def prepare_minmaxscaler(self):
        # must be called on pd.DataFrame
        self.mins  = self.features.rolling(self.lookback,center=False,min_periods=self.lookback).min().shift(-self.lookback+1).values.astype(np.float64)
        self.maxs  = self.features.rolling(self.lookback,center=False,min_periods=self.lookback).max().shift(-self.lookback+1).values.astype(np.float64)
        self.diffs = self.maxs - self.mins
        return 

    def minmaxscaler(self,x,idx):
        return (x-self.mins[idx])/self.diffs[idx]

    def __len__(self):
        return len(self.features) - self.lookback + 1

    def __getitem__(self, idx):
        feature2d = self.features[idx:idx+self.lookback,:]
        label     = self.labels[idx+self.lookback-1]
        if self.transform:
            feature2d = self.transform(feature2d,idx)
        if self.target_transform:
            label = self.target_transform(label)
        return feature2d, label

The inputs features and labels are pd.DataFrames. As you can see, if the parameter transform=‘minmaxscaler’, the transformation defined in the method minmaxscaler is applied when getitem is called. In order to accelerate this, instead of computing the min and max on-the-go, they are precomputed and stored inside the dataset object by means of the function prepare_minmaxscaler.

Would that be a good way to do it, or is this typically done differently?
I appreciate your response!

Best,
JZ