Issues with torch.utils.data.random_split

I’m not sure, if the issue is solved, so let me know if you are stuck.
Since df_labels contains the targets, you should be able to use them in train_test_split to create the split indices and create the datasets via Subsets.

1 Like

I tried to use Subsets like this to split to training and validation

       training_set = torch.utils.data.Subset(dataset, (range(0, len(dataset), 2)))
        validation_set = torch.utils.data.Subset(dataset, (range(1, len(dataset), 2)))

my bad , I got no idea how to use train_test_split on a pytorch dataset.
i got only with random_splits and Subset to split dataset ,

It is working

but im worrying if there was unbalanced on validation which would result bad prediction.
Also which should you recommend to use Subset or random_splits to split dataset.
Thankyou for helping :slight_smile:

How do you use different transforms for the results of random_split? For example, I have

from torch.utils.data import DataLoader, random_split
from torch import Generator
from torchvision.transforms import ToTensor
from torchvision.datasets import ImageFolder


TEST_RATIO = 0.2
BATCH_SIZE = 32

# Download and load the training data
dataset_all = ImageFolder(
    data_dir,
    transform=ToTensor(),
)

size_all = len(dataset_all)
print(f'Before splitting the full dataset into train and test: len(dataset_all)={size_all}')


size_test = int(size_all * TEST_RATIO)
size_train = size_all - size_test

dataset_train, dataset_test = random_split(dataset_all, [size_train, size_test], generator=Generator().manual_seed(SEED))
print(f'After splitting the full dataset into train and test: len(dataset_train)={len(dataset_train)}. len(dataset_test)={len(dataset_test)}')

What if I want to use ColorJitter for train but not for test?

For your use case I would probably use Subsets and pass the indices explicitly as seen in this example as it would allow you to keep the specified transformations.