Pass indices to `WeightedRandomSampler()`?

Hi!

I’m working on a project where it’s advantageous to use a sampler that is fed with a list of indices (as is the case with SubsetRandomSampler()), however in using the WeightedRandomSampler() – where it uses len(weights) instead – my dataloader is trying to load indices that I don’t want.

I.e. I want: indices = [0,2,5,6] and I process my weights using the infamous make_weights_for_balanced_classes method, but len(weights) would prompt the loader to look for samples 0,1,2,3.

Question: Can I achieve what I need without messing about with the source? If so, any hints are greatly appreciated. Hope all’s clear.

I think setting the weights to 0 for the unwanted samples should work as they should be skipped. Something like this should work:

# Create dummy data with class imbalance 99 to 1
numDataPoints = 1000
data_dim = 1
bs = 100
data = torch.arange(numDataPoints*data_dim).view(numDataPoints, data_dim)
target = torch.cat((torch.zeros(int(numDataPoints * 0.99), dtype=torch.long),
                    torch.ones(int(numDataPoints * 0.01), dtype=torch.long)))

print('target train 0/1: {}/{}'.format(
    (target == 0).sum(), (target == 1).sum()))

# mask unwanted samples to calculate the weights without the removed samples
unwanted = torch.randint(0, 1000, (100,))
mask = torch.tensor([True]*numDataPoints)
mask[unwanted] = False
target_masked = target[mask]

# Compute samples weight (each sample should get its own weight)
class_sample_count = torch.tensor(
    [(target_masked == t).sum() for t in torch.unique(target_masked, sorted=True)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[t] for t in target])
samples_weight[~mask] = 0.

# Create sampler, dataset, loader
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=bs, sampler=sampler)

# Iterate DataLoader and check class balance for each batch
for i, (x, y) in enumerate(train_loader):
    print("batch index {}, 0/1: {}/{}".format(
        i, (y == 0).sum(), (y == 1).sum()))

Note that I’ve kept len(samples_weight), but you might want to subtract the number of masked samples from it.

Cheers Piotr, only issue is that the data themselves don’t exist and my dataloader’s __getitem__ looks for samples by the index. So my samples folder only contains the indices I pass.

If you think that the solution to this would inevitably be papering over cracks and that it’s better to redesign my pipeline, do let me know and I can go back to the drawing board, else any suggestions are welcome :slight_smile:.

I think your lazy loading approach is fine to use in your actual training pipeline.
However, you would need to grab all target values once to set up the sampler.
If you’ve stored to weights, you could then just load them in your actual training and could stick to your Dataset implementation. Let me know, if this would work.

Very smart idea, just set the weights of unwanted samples to zero. Thanks