How to distinguish between validation and train set when using random_split for custom dataset?

I defined a simple custom data set like in the official example and named the class “Dataset”. Then in my training method I split the data set into training and validation like this:

    dataset = Dataset(opt)
    train_size = int(0.8*len(dataset))
    val_size = len(dataset) - train_size
    lengths = [train_size, val_size]

    train_dataset, val_dataset = torch.utils.data.dataset.random_split(dataset, lengths)
    trainloader = DataLoader(
        train_dataset,
        batch_size=opt.n_batches,
        shuffle=True,
        num_workers=opt.n_workers,
        pin_memory=True
        )
    valloader = DataLoader(
        val_dataset,
        batch_size=1,
        shuffle=True,
        num_workers=opt.n_workers,
        pin_memory=True
        )

In the __getitem__(self, idx) method of the data set I would like to have different functionality for validation set than for training set. How can I make a difference? Should I initialize the datasets independently?

I am basically looking for a variable like self.training in the nn.Module class (see here).

Instead of using dataset.random_split I would create indices for both splits, create two separate datasets with the custom transformations etc. applied to the training and validation dataset, and wrap both datasets in Subsets using the indices.
This would allow you to pass the different arguments or transformations without manipulating some internal flags.

1 Like