What is the difference between creating a validation set using random_split as opposed to SubsetRandomSampler?

Hi all!

I was wondering what the difference between using random_split and SubsetRandomSampler is.

Do we use SubsetRandomSampler together with Subset?

What are the pros and cons of using SubsetRandomSampler and random_split methods for creating a train-val split (or any split for that matter)for image datasets?

1 Like

random_split returns two Datasets with non-overlapping indices, which were drawn randomly based on the passed lengths, while SubsetRandomSampler accepts the indices directly.
Both can be used for different use cases.

E.g. if you want to run a training epoch with certain samples only (e.g. from a subset of classes or “hard to learn” samples), you could calculate these indices somehow and use a SubsetRandomSampler for this training epoch.

random_split on the other hand splits the dataset randomly, so that you won’t have that much control over it.

2 Likes

Ah. So, with random_split, I wouldn’t be able, for instance, do a stratified split based on output labels in case of imbalanced data? But I would able to do the same with SubsetRandomSampler/ WeightedRandomSampler?

Yes, you could create balanced batches using WeightedRandomSampler or over/undersample specific classes with SubsetRandomSampler, while this wouldn’t be possible by just using random_split.
However, you could of course use a random_split in combination with the samplers. :wink:

1 Like

After going through a lot of blogs and answers on the discuss page, I think I finally got the hang of it.

Let me know If I got the following implementations of train + val set creation using random_split and SubsetRandomSampler right.

And finally, discuss the challenge of doing this with WeightedRandomSampler.

The data is read using ImageFolder. Task is binary image classification with 498 images in the dataset which are equally distributed among both classes (249 images each).

img_dataset = ImageFolder(..., transforms=t)

1. SubsetRandomSampler

dataset_size = len(img_dataset)
dataset_indices = list(range(dataset_size))

np.random.shuffle(dataset_indices)

val_split_index = int(np.floor(0.2 * dataset_size))

train_idx, val_idx = dataset_indices[val_split_index:], dataset_indices[:val_split_index]

train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)


train_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=8, sampler=train_sampler)
validation_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=1, sampler=val_sampler)

2. random_split

Here, out of the 498 total images, 400 get randomly assigned to train and the rest 98 to validation.

dataset_train, dataset_valid = random_split(img_dataset, (400, 98))

train_loader = DataLoader(dataset=dataset_train, shuffle=True, batch_size=8)
val_loader = DataLoader(dataset=dataset_valid, shuffle=False, batch_size=1)

3. WeightedRandomSampler

if someone stumbled here searching for WeightedRandomSampler, check @ptrblck’s answer here for a good explanation of what is happening below.

Now, how does WeightedRandomSampler fit in creating train+val set? Because unlike SubsetRandomSampler or random_split(), we’re not splitting for train and val here. We’re simply ensuring that each batch gets equal number of classes during training.

So, my guess is we need to use WeightedRandomSampler after random_split() or SubsetRandomSampler. But this wouldn’t ensure that train and val have similar ratio between classes.

target_list = []

for _, t in imgdataset:
    target_list.append(t)
    
target_list = torch.tensor(target_list)
target_list = target_list[torch.randperm(len(target_list))]

# get_class_distribution() is a function that takes in a dataset and 
# returns a dictionary with class count. In this case, the 
# get_class_distribution(img_dataset)  returns the following - 
# {'class_0': 249, 'class_0': 249}
class_count = [i for i in get_class_distribution(img_dataset).values()]
class_weights = 1./torch.tensor(class_count, dtype=torch.float) 

class_weights_all = class_weights[target_list]

weighted_sampler = WeightedRandomSampler(
    weights=class_weights_all,
    num_samples=len(class_weights_all),
    replacement=True
)

So, how do we ensure that our train and val sets as well as batches in train_loader and val_loader have an equal distribution of output classes?

2 Likes

I would use a stratified split using e.g. sklearn.model_selection.train_test_split, calculate the weights separately for each split, and use a WeightedRandomSampler for each subset to get balanced batches.

4 Likes

Thank you so much for your patience in answering! :slight_smile:

Ah. Is there a native PyTorch function that does that or do I have to use sklearn only?

Also, are the above implementations correct?

I’m not aware of a stratified split in PyTorch and would just use the sklearn implementation.

The implementations look correct, but I haven’t tested them. :wink:

1 Like

please can you show us how the code will look like?