Using WeightedRandomSampler with ConcatDataset

borisgiba · July 29, 2019, 6:32pm

Hi
I have a rather small image dataset and want want to augment my training images.
However I want the training-dataloader to use unaugmented images as well as augmented images.
For that I am using the ConcatDataset-class. I also want to use WeightedRandomSampler, because some classes have more images than others, the amounts of images per class being: [602,536,1088,751].

datasetBasic = datasets.ImageFolder('path', transform=transformsBasic)
datasetAugmented = datasets.ImageFolder('path', transform=transformsAugment)
concatDataset=torch.utils.data.ConcatDataset((datasetBasic,datasetAugmented))

dataloaders_dict = {"train": torch.utils.data.DataLoader(concatDataset, batch_size=batch_size, sampler=sampler, num_workers=0, pin_memory=True),
                     "val": torch.utils.data.DataLoader(datasetBasic, batch_size=batch_size, shuffle=True, num_workers=0,pin_memory=True)}

However I do not know how I should use WeightedRandomSampler together with ConcatDataset.
Is there any suggestion to how I can solve this issue?

This post ist very similar to: Sampling from a concatenated dataset , but there were no more replies on that one so I am asking again ^^’.

PS: Great forums, the people seem to be really active here ^^

ptrblck · July 29, 2019, 9:38pm

For weighted sampling you would have to create a weight for each sample.
If you don’t have the target tensors already computed, you could iterate your dataset and store the target tensors.
Here is a small example, which should match your use case:

# Create dummy data with class imbalance 99 to 1
numDataPoints = 1000
data_dim = 5
bs = 100
data = torch.randn(numDataPoints, data_dim)
target = torch.cat((torch.zeros(int(numDataPoints * 0.99), dtype=torch.long),
                    torch.ones(int(numDataPoints * 0.01), dtype=torch.long)))

print('target train 0/1: {}/{}'.format(
    (target == 0).sum(), (target == 1).sum()))

# Create ConcatDataset
dataset = torch.utils.data.TensorDataset(data, target)
train_dataset = ConcatDataset((dataset, dataset))

# Get all targets
targets = []
for _, target in train_dataset:
    targets.append(target)
targets = torch.stack(targets)

# Compute samples weight (each sample should get its own weight)
class_sample_count = torch.tensor(
    [(targets == t).sum() for t in torch.unique(targets, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in targets])

# Create sampler, dataset, loader
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

# Iterate DataLoader and check class balance for each batch
for i, (x, y) in enumerate(train_loader):
    print("batch index {}, 0/1: {}/{}".format(
        i, (y == 0).sum(), (y == 1).sum()))

In the first part I’m creating a dummy imbalanced dataset.
You should of course just skip this step and use your original concatDataset.

After storing all targets, the class_sample_count and the corresponding samples_weight tensor is created, which is used to create the WeightedRandomSampler.
As you can see in the last loop, each batch should be balanced using the sampler.

Let me know, if that would work for you.

borisgiba · July 30, 2019, 10:19am

First up thank you for the quick and detailed response!

I have tried executing the code at #Get all targets but have run into the following error:

TypeError: expected Tensor as element 0 in argument 0, but got int

which appears at the following line:

targets = torch.stack(targets)

When looking at targets, I noticed in my case it is full of ints, but torch.stack rather expects Tensors, if I understand it correctly?
Is this problematic or should I just use different functions to compute the samples weight?

ptrblck · July 30, 2019, 4:07pm

In that case you could use torch.tensor instead.
Let me know, if that works.

borisgiba · July 30, 2019, 6:44pm

Yes, that worked!
Thanks again for the help.