Using ConcatDataset with WeightedRandomSampler

pinouchon · June 17, 2019, 3:15pm

I am using a ConcatDataset with a WeightedRandomSampler like this:

training_sets = data_augment(training_set)
self.train_dataset = ConcatDataset([MyDataset(
    features=self.features, 
    transform=transforms.Compose([ToTensor()])
) for _set in training_sets])

num_samples = len(self.train_dataset)
weights = np.linspace(1, 3, num_samples)
sampler = WeightedRandomSampler(weights, num_samples)
self.train_loader = DataLoader(
    self.train_dataset, sampler=sampler, num_workers=7, batch_size=self.batch_size
)

My idea for using the WeightedRandomSampler is to train more often on recent inputs (I want to see if that improves generalization performance).

However, because I am using a concat dataset, I will sample more often inputs in the last training sets that are concatenated at the end, rather than inputs at the end of each sets. Is that right?

If that’s the case, then the fix would be to do multiple np.linspace(), and then concatenate them the same way the ConcatDataset concats my training dataset.

Thoughts?

pinouchon · June 17, 2019, 3:37pm

I’ve changed my implementation to this:

training_sets = data_augment(training_set)

# Storing the datasets before they are concatenated, 
# because we need the lengths of each
self.train_datasets = [MyDataset(
    features=self.features, 
    transform=transforms.Compose([ToTensor()])
) for _set in training_sets]
self.train_dataset = ConcatDataset(self.train_datasets)

# Creating weights for each concatenated dataset
weights = []
for d in self.train_datasets:
    dataset_len = len(d)
    weights.append(np.linspace(1, 3, dataset_len))
weights = np.concatenate(tuple(weights))
sampler = WeightedRandomSampler(weights, num_samples)
self.train_loader = DataLoader(self.train_dataset, sampler=sampler, num_workers=7, batch_size=self.batch_size)

I’ll let you know if it works better after my training run.
I am too lazy to debug this with synthetic data.