Custom Dataset with some preprocessing

Hello, I have a question. I have about 2 million images (place365-standard dataset) and I want to do some data augmentation like transforming, cropping etc. Also, I have to make my own target image (y) based on some color model algorithms (CMYK) for example.

So Actually, my preprocessing step includes augmentation and making terget image (y). And then I should feed these images to a deep network. When should I do this based on Dataset class? Should I do my processing step in __getitem__()? If yes, would it be parallel and fast?

Here is my template:

import torch
from torch.utils import data

class Dataset(data.Dataset):
    """
    Return Dataset class representing our data set
    """
    def __int__(self, list_IDs, labels):
        """
        Initialize data set as a list of IDs corresponding to each item of data set and labels of each data

        Args:
            list_IDs: a list of IDs for each data point in data set
            labels: label of an item in data set with respect to the ID
        """

        self.labels = labels
        self.list_IDs = list_IDs

    def __len__(self):
        """
        Return the length of data set using list of IDs

        :return: number of samples in data set
        """
        return len(self.list_IDs)

    def __getitem__(self, item):
        """
        Generate one item of data set. Here we apply our preprocessing things like halftone styles and subtractive color process using CMYK color model etc. (See the paper for operations)

        :param item: index of item in IDs list

        :return: a sample of data
        """
        ID = self.list_IDs[item]

        # Code to load data
        X = None #

        # code to apply your custom function to make y image (time consuming task - some algorithms)
        y = None #

        return X, y


Thanks for any advice

Best regards

When should I do this based on Dataset class? Should I do my processing step in __getitem__() ? If yes, would it be parallel and fast?

Yes. the __getitem__ calls are the ones run in parallel. Here’s some example:

class CelebaDataset(Dataset):
    """Custom Dataset for loading CelebA face images"""

    def __init__(self, txt_path, img_dir, transform=None):
    
        df = pd.read_csv(txt_path, sep=" ", index_col=0)
        self.img_dir = img_dir
        self.txt_path = txt_path
        self.img_names = df.index.values
        self.y = df['Male'].values
        self.transform = transform

    def __getitem__(self, index):
        img = Image.open(os.path.join(self.img_dir,
                                      self.img_names[index]))
        
        if self.transform is not None:
            img = self.transform(img)
        
        label = self.y[index]
        return img, label

    def __len__(self):
        return self.y.shape[0]

For the “transforms” call you can can use the torchvision.transforms utilities to “compose” a transformer pipeline. E.g.,

custom_transform = transforms.Compose([transforms.Grayscale(),                                       
                                       #transforms.Lambda(lambda x: x/255.), # scaling is already done by ToTensor()
                                       transforms.ToTensor()])

train_dataset = CelebaDataset(txt_path='celeba_gender_attr_train.txt',
                              img_dir='img_align_celeba/',
                              transform=custom_transform)

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=4)
2 Likes

Thank you for answering.
So If I want to do any other things on my images, I should put the codes in __getitem__() method.

How to change this line for several classes?

This is a column that was in the CSV file. You could create CSV column that contains all the classes of interest and that should do the trick.

@rasbt I have a column and label the attritbutes with 1, 2, 3, 5 … instead of 1 and 0?

Yes, you are correct