What is the difference between creating a validation set using random_split as opposed to SubsetRandomSampler?

theairbend3r · March 12, 2020, 6:33pm

After going through a lot of blogs and answers on the discuss page, I think I finally got the hang of it.

Let me know If I got the following implementations of train + val set creation using random_split and SubsetRandomSampler right.

And finally, discuss the challenge of doing this with WeightedRandomSampler.

The data is read using ImageFolder. Task is binary image classification with 498 images in the dataset which are equally distributed among both classes (249 images each).

img_dataset = ImageFolder(..., transforms=t)

1. `SubsetRandomSampler`

dataset_size = len(img_dataset)
dataset_indices = list(range(dataset_size))

np.random.shuffle(dataset_indices)

val_split_index = int(np.floor(0.2 * dataset_size))

train_idx, val_idx = dataset_indices[val_split_index:], dataset_indices[:val_split_index]

train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)


train_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=8, sampler=train_sampler)
validation_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=1, sampler=val_sampler)

2. `random_split`

Here, out of the 498 total images, 400 get randomly assigned to train and the rest 98 to validation.

dataset_train, dataset_valid = random_split(img_dataset, (400, 98))

train_loader = DataLoader(dataset=dataset_train, shuffle=True, batch_size=8)
val_loader = DataLoader(dataset=dataset_valid, shuffle=False, batch_size=1)

3. `WeightedRandomSampler`

if someone stumbled here searching for WeightedRandomSampler, check @ptrblck’s answer here for a good explanation of what is happening below.

Now, how does WeightedRandomSampler fit in creating train+val set? Because unlike SubsetRandomSampler or random_split(), we’re not splitting for train and val here. We’re simply ensuring that each batch gets equal number of classes during training.

So, my guess is we need to use WeightedRandomSampler after random_split() or SubsetRandomSampler. But this wouldn’t ensure that train and val have similar ratio between classes.

target_list = []

for _, t in imgdataset:
    target_list.append(t)
    
target_list = torch.tensor(target_list)
target_list = target_list[torch.randperm(len(target_list))]

# get_class_distribution() is a function that takes in a dataset and 
# returns a dictionary with class count. In this case, the 
# get_class_distribution(img_dataset)  returns the following - 
# {'class_0': 249, 'class_0': 249}
class_count = [i for i in get_class_distribution(img_dataset).values()]
class_weights = 1./torch.tensor(class_count, dtype=torch.float) 

class_weights_all = class_weights[target_list]

weighted_sampler = WeightedRandomSampler(
    weights=class_weights_all,
    num_samples=len(class_weights_all),
    replacement=True
)

So, how do we ensure that our train and val sets as well as batches in train_loader and val_loader have an equal distribution of output classes?

What is the difference between creating a validation set using random_split as opposed to SubsetRandomSampler?

1. SubsetRandomSampler

2. random_split

3. WeightedRandomSampler

1. `SubsetRandomSampler`

2. `random_split`

3. `WeightedRandomSampler`