Homogeneous classes with sickit-learn split and custom PyTorch Loader

Hi All,

I have a dataset recorded manually, each instance in my dataset is saved in a npy file. For that, I split on the paths of my instances and then load those paths in my custom data loader and extract the labels, and return the instance and its labels.

import os
from sklearn.model_selection import train_test_split
import numpy as np
import torch
X_train, X_test = train_test_split(full_paths,
                                   test_size=0.2,
                                   random_state=42,
                                   shuffle=True,
                                   )

X_train, X_val = train_test_split(X_train,
                                  test_size=0.2,
                                  random_state=42,
                                  shuffle=True,
                                 )

Then I define my data loader parameters

# Training Parameters
params_train = {'batch_size': 25,
                'shuffle': True,
                'drop_last': True,
                }

# Testing Parameters
params_test = {'batch_size': 25,
               'shuffle': False,
               'drop_last': True,
               }

Then my data loader looks like

class Dataset(torch.utils.data.Dataset):
    """Characterizes a dataset for PyTorch"""

    def __init__(self, file_paths):
        """Initialization"""
        self.file_paths = file_paths
        self.labels = {
            "Class 0": 0,
            "Class 1": 1,
            "Class 2": 2,
            "Class 3": 3,
            "Class 4": 4,
            "Class 5": 5,
            "Class 6": 6,
            "Class 7": 7,
        }

    def __len__(self):
        """Denotes the total number of samples"""
        return len(self.file_paths)

    def __getitem__(self, index):
        """Generates one sample of data"""
        # Load data and get label
        file_path = self.file_paths[index]
        x = np.load(file_path)[:100]
        label = os.path.basename(file_path).split('_')[0]
        y = self.labels[label]

        return x, y

Then prepare my generators to be used inside the training, validation, and testing loops

training_set = Dataset(X_train)
training_generator = torch.utils.data.DataLoader(training_set, **params_train)

test_set = Dataset(X_test)
test_generator = torch.utils.data.DataLoader(test_set, **params_test)

val_set = Dataset(X_val)
val_generator = torch.utils.data.DataLoader(val_set, **params_train)

However, the problem here is that the split is not balanced, I need then to use the stratify parameter in the split, and to do that I have to define the labels beforehand, and I only extract the labels at the end of the data loader.

My question is:

  1. Does it make sense to stratify a balanced dataset? As my dataset is balanced in the mean of I recorded the same number of recordings for each class.

  2. Does it make sense also to apply shuffling in both the split of sci-kit learn, then inside the data loader?

  3. Is there an intelligent way of implementing this, instead of just having to define my labels before the split? I mean a way to be able to split on the paths as I am doing and extracting the labels inside of the data loader as well while having an equal proportion of each class in each of the splits?

Thanks.

@ptrblck Please have a look :smile: