DataLoader - using SubsetRandomSampler and WeightedRandomSampler at the same time

ajwitty · July 31, 2021, 12:19am

im trying to select 3 of 4 classes out of the Image dataset, and also provide class weights to those specific classes. Does weighted random do both of these tasks?

train_dataset = dset.ImageFolder(os.path.join(dataroot), transform=transforms.Compose(
                                                                             [SmallScale(image_size),
                                                                              transforms.RandomCrop(image_size),
                                                                              transforms.Grayscale(),
                                                                              transforms.RandomHorizontalFlip(),
                                                                              transforms.RandomVerticalFlip(),
                                                                              transforms.ToTensor(),
                                                                              transforms.Normalize([0.5],[0.5])
                                                                              ]))
        targets = train_dataset.targets
        target_values = img_class  # img_class = [[1, 2, 3]] out of 4 classes
        target_bool = []
        for x in targets:
            if x in target_values:
                target_bool.append(True)
            else:
                target_bool.append(False)
        target_idx = torch.tensor(target_bool).nonzero()

        sample_idx = target_idx
        sample_tar = torch.Tensor(targets)
        
        sample_range = sample_tar[sample_idx.squeeze()]
        ordered_range = sample_tar[sample_idx.squeeze()]
        min_val = sample_range.min()
        max_val = sample_range.max()
        for val, tar in enumerate(ordered_range):
            if tar == min_val:
                ordered_range[val] = 0
            elif tar == max_val:
                ordered_range[val] = 2
            elif tar != min_val and tar != max_val:
                ordered_range[val] = 1
        target_idx = target_idx[:, 0].tolist()

        class_sample_count = torch.tensor([(sample_range == t).sum() for t in torch.Tensor(img_class)])
        weight = 1. / class_sample_count.float()
        samples_weight = torch.tensor([weight[t] for t in ordered_range.long()])

        balanced_sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))
        # temporal_sampler = torch.utils.data.sampler.SubsetRandomSampler(target_idx)

    data_loader = torch.utils.data.DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=False,  # (train_sampler is None),
            drop_last=True,
            num_workers=int(workers),
            sampler=balanced_sampler  # but want to use the indices of target_idx instead of the entire dataset
        )

    # data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
    #                                           shuffle=True,
    #                                           num_workers=int(workers))
    return data_loader```

raquelhortab · April 4, 2023, 8:54am

Hmm… But this does not even make use of the samplers utility of not having to split the dataset… By having the subset samplers you only need one dataset and don’t have to keep track of the indices everywhere… I think the option @pevogam proposed is a bit better, but do we know for sure that setting weight=0 will never let a validation sample enter the training set? Are we sure weights are not only approximate? It feels a bit risky

ptrblck · April 4, 2023, 9:43am

I think your idea is valid and indeed @pevogam’s suggestion makes sense if you filter out “invalid samples” by setting their weights to zero.
Yes, setting weights to zero should not sample these, but given the initial question I felt my approach would be more explicit as the sampler couldn’t possibly sample any indices from other dataset splits.
My concern should be invalid, but I’m often leaning more towards guaranteeing no data leak could happen.

raquelhortab · April 4, 2023, 10:02am

I did a little experiment to check that it would “never” generate a sample from those with weight=0 and it seems to work. As you say, I prefer guaranteeing there’s no data leak and this solution makes me doubt a bit but I feel it’s the best I have found without having to split my dataset

UPDATE: it does not work!

when setting replacement=False, the sampler does use unexpected samples. This is what I did to test it:

sampler = WeightedRandomSampler(weights=[0]*(len(ds)-1) + [1], num_samples=10, replacement=False)
data_loader = DataLoader(ds, batch_size=16, num_workers=1, sampler=sampler)
expected_data = ds[-1]
ok_count = 0
for data in data_loader:
  if torch.equal(data["ids"], expected_data["ids"].repeat(len(data["ids"]),1)):
      print("ok")
      ok_count += 1
  else:
      print(data["ids"])
      print(expected_data["ids"].repeat(len(data["ids"]),1))
      raise Exception("Unexpected data")
print("ok: ", ok_count)

and got:

tensor([[ 1, 15, 16,  ...,  0,  0,  0],
        [ 1, 15,  9,  ...,  0,  0,  0],
        [ 1, 15, 16,  ...,  0,  0,  0],
        ...,
        [ 1, 15,  9,  ...,  0,  0,  0],
        [ 1, 15, 15,  ...,  0,  0,  0],
        [ 1, 15, 15,  ...,  0,  0,  0]])
tensor([[ 1, 15, 16,  ...,  0,  0,  0],
        [ 1, 15, 16,  ...,  0,  0,  0],
        [ 1, 15, 16,  ...,  0,  0,  0],
        ...,
        [ 1, 15, 16,  ...,  0,  0,  0],
        [ 1, 15, 16,  ...,  0,  0,  0],
        [ 1, 15, 16,  ...,  0,  0,  0]])

This did not happen when I used replacement=True but it seems when you don’t let it replace, it needs to get other samples, maybe?

raquelhortab · April 4, 2023, 10:47am

I am thinking of another solution. Using SubsetRandomSampler but repeating the minority class indices so it oversamples them. This has quite a bad downside though, since it would not benefit from being able to downsample the majority class without discarding samples (I believe WeightedRandomSampler can undersample the majority class in each batch but will probably use most of the samples across multiple batches). It is not the first time I would have benefited from combining samplers, it’s a pity there is no sampler that achieves both subset sampling and weighted sampling…

EDIT: I’m now thinking and I believe oversampling the minority class in SubsetRandomSampler could end up yielding the same results. It’s true that we would not have repeated samples in a same batch with the WeightedRandomSampler. However, in order to provide balanced data in all batches, WeightedRandomSampler would need to end up repeating the minority class at some point. The difference is it could take multiple epochs for WeightedRandomSampler to yield a repeated sample and with SubsetRandomSampler we would be using oversampled data in every epoch. But epochs are just a way of organizing the training, and as I see it, the effect should not really vary, does anyone see any problem with this solution?