ImageFolder with Subset and WeightedRandomSampler

BoBTB · November 19, 2020, 3:13pm

Hi have a question concerning ImageFolder which i want to split into train and validation dataset and balance it with a WeightedRandomSampler into a dataloader.

dataset = datasets.ImageFolder(path, transform)

First I split it with sklearn

from sklearn.model_selection import train_test_split
image_datasets = {}
train_idx, val_idx = train_test_split(list(range(len(dataset))), stratify=dataset.targets, test_size=0.2)
image_datasets['train'] = split_result['train']
image_datasets['val'] = split_result['val']

Then i want to sample train dataset with WeightedRandomSampler because it is unbalanced.

# get the labels for the subset
labels = np.array(image_datasets['train'].dataset.targets)[image_datasets['train'].indices]
# count the label occurrence
num_class_elements = np.bincount(labels)
# length of new dataset
num_epoch_elements = len(image_datasets['train'])

create the WeightedRandomSampler

numerator = 1. 
denominator = torch.tensor(num_class_elements,dtype=torch.float)  # can be converted with torch.tensor(..., dtype=torch.float)

# calculate weight per class
class_weights = numerator / denominator
# create vector where index contains class weight
element_weights = [class_weights[class_index] for class_index in labels]
# create sampler
sampler = torch.utils.data.sampler.WeightedRandomSampler(element_weights, num_epoch_elements, replacement=False)

Is it now save to use the SubSet combined with the sampler for the dataloader?

train_dl = torch.utils.data.DataLoader(image_datasets['train'], batch_size=18, num_workers=2,
                                               pin_memory=True, drop_last=False, shuffle=False, sampler=sampler)

Question herer:
My concern is as follows. Maby i can explain it with an example, the original dataset has 100 elements [0:99] and the training dataset has 80, with index [19:99]=80. The sampler will return indices from [0:79] which the SubSet needs to transform to its parents datasets indices [0:99] excluding the [0:19].
So is the Subset handling this or not?
Is the Subset mapping the indices from the sampler to its indices of the parent Dataset?

Sampler says 0 which corresponds to train_ds[0] which calls parent.dataset[train_ds.indices[0]] #pseudocode