Hi have a question concerning ImageFolder which i want to split into train and validation dataset and balance it with a WeightedRandomSampler into a dataloader.
dataset = datasets.ImageFolder(path, transform)
First I split it with sklearn
from sklearn.model_selection import train_test_split
image_datasets = {}
train_idx, val_idx = train_test_split(list(range(len(dataset))), stratify=dataset.targets, test_size=0.2)
image_datasets['train'] = split_result['train']
image_datasets['val'] = split_result['val']
Then i want to sample train dataset with WeightedRandomSampler because it is unbalanced.
# get the labels for the subset
labels = np.array(image_datasets['train'].dataset.targets)[image_datasets['train'].indices]
# count the label occurrence
num_class_elements = np.bincount(labels)
# length of new dataset
num_epoch_elements = len(image_datasets['train'])
create the WeightedRandomSampler
numerator = 1.
denominator = torch.tensor(num_class_elements,dtype=torch.float) # can be converted with torch.tensor(..., dtype=torch.float)
# calculate weight per class
class_weights = numerator / denominator
# create vector where index contains class weight
element_weights = [class_weights[class_index] for class_index in labels]
# create sampler
sampler = torch.utils.data.sampler.WeightedRandomSampler(element_weights, num_epoch_elements, replacement=False)
Is it now save to use the SubSet combined with the sampler for the dataloader?
train_dl = torch.utils.data.DataLoader(image_datasets['train'], batch_size=18, num_workers=2,
pin_memory=True, drop_last=False, shuffle=False, sampler=sampler)
Question herer:
My concern is as follows. Maby i can explain it with an example, the original dataset has 100 elements [0:99] and the training dataset has 80, with index [19:99]=80. The sampler will return indices from [0:79] which the SubSet needs to transform to its parents datasets indices [0:99] excluding the [0:19].
So is the Subset handling this or not?
Is the Subset mapping the indices from the sampler to its indices of the parent Dataset?
Sampler says 0 which corresponds to train_ds[0] which calls parent.dataset[train_ds.indices[0]] #pseudocode