High-level question: how did you implement dataset splitting and standardisation of training data?

wzsmith · July 11, 2024, 12:22am

Hi ptrblck,

Thank you for your in-depth responses on the forums! I’m currently working with sequential image classifier using a CNN-Bi-LSTM architecture. Fixed-length sequences (length ~5) of images are fed into my network, which outputs a single label for classification.

My current approach to splitting the data has been creating a Dataset that reads in the image paths and class IDs, and processes the index from __getitem__() to generate sequences in a sliding window fashion. This works perfectly fine, but is it more standard to have this sequence partitioning logic implemented in a Sampler (similar to your reply here)?

I’m currently using random_split for my train, validation, and test splits, and have since realized there is data leakage due to the sliding window approach. How do you recommend fixing this while keeping the model robust (without breaking the sequence partitioning logic)? If I don’t allocate my testing data in a large chunk (i.e. random sampling), I’ll lose a good amount of potential sequences.

Here is my Dataset class below:

class MyDataset(Dataset):
    def __init__(self, img_dir, annotations_file, seq_length, transform=None, target_transform=None):
        self.img_dir = img_dir
        self.img_labels = pd.read_csv(annotations_file, header=None, names=['image', 'class'])
        self.seq_length = seq_length
        self.transform = transform
        self.target_transform = target_transform
        self.class_groups = self.img_labels.groupby('class')

    def __len__(self):
        return sum(len(group) - self.seq_length + 1 for _, group in self.class_groups)

    def __getitem__(self, idx):
        # Find class index and index within the class
        for class_label, group in self.class_groups:
            group_size = len(group) - self.seq_length + 1
            if idx < group_size:
                group_idx = idx
                break
            else:
                idx -= group_size
        
        # Read sequence images
        img_paths = group.iloc[group_idx : group_idx + self.seq_length, 0].tolist()
        images = []
        for img_path in img_paths:
            image = read_image(os.path.join(self.img_dir, img_path)).float()
            if self.transform:
                image = self.transform(image)
            images.append(image)

        images = torch.stack(images)    
        label = torch.tensor(class_label)

        if self.target_transform:
            label = self.target_transform(label)

        return images, label

Also on another note, will the LSTM learn effectively if I am training on sequences from different classes in my batches?

Really appreciate your help!