After going through a lot of blogs and answers on the discuss page, I think I finally got the hang of it.
Let me know If I got the following implementations of train + val set creation using random_split
and SubsetRandomSampler
right.
And finally, discuss the challenge of doing this with WeightedRandomSampler
.
The data is read using ImageFolder
. Task is binary image classification with 498 images in the dataset which are equally distributed among both classes (249 images each).
img_dataset = ImageFolder(..., transforms=t)
1. SubsetRandomSampler
dataset_size = len(img_dataset)
dataset_indices = list(range(dataset_size))
np.random.shuffle(dataset_indices)
val_split_index = int(np.floor(0.2 * dataset_size))
train_idx, val_idx = dataset_indices[val_split_index:], dataset_indices[:val_split_index]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
train_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=8, sampler=train_sampler)
validation_loader = DataLoader(dataset=img_dataset, shuffle=False, batch_size=1, sampler=val_sampler)
2. random_split
Here, out of the 498 total images, 400 get randomly assigned to train and the rest 98 to validation.
dataset_train, dataset_valid = random_split(img_dataset, (400, 98))
train_loader = DataLoader(dataset=dataset_train, shuffle=True, batch_size=8)
val_loader = DataLoader(dataset=dataset_valid, shuffle=False, batch_size=1)
3. WeightedRandomSampler
if someone stumbled here searching for
WeightedRandomSampler
, check @ptrblck’s answer here for a good explanation of what is happening below.
Now, how does WeightedRandomSampler
fit in creating train+val set? Because unlike SubsetRandomSampler
or random_split()
, we’re not splitting for train and val here. We’re simply ensuring that each batch gets equal number of classes during training.
So, my guess is we need to use WeightedRandomSampler
after random_split()
or SubsetRandomSampler
. But this wouldn’t ensure that train and val have similar ratio between classes.
target_list = []
for _, t in imgdataset:
target_list.append(t)
target_list = torch.tensor(target_list)
target_list = target_list[torch.randperm(len(target_list))]
# get_class_distribution() is a function that takes in a dataset and
# returns a dictionary with class count. In this case, the
# get_class_distribution(img_dataset) returns the following -
# {'class_0': 249, 'class_0': 249}
class_count = [i for i in get_class_distribution(img_dataset).values()]
class_weights = 1./torch.tensor(class_count, dtype=torch.float)
class_weights_all = class_weights[target_list]
weighted_sampler = WeightedRandomSampler(
weights=class_weights_all,
num_samples=len(class_weights_all),
replacement=True
)
So, how do we ensure that our train and val sets as well as batches in train_loader and val_loader have an equal distribution of output classes?