How to split a sampled data loader? (combine WeightedRandomSampler + SubsetRandomSampler)

taiky · October 4, 2018, 8:55pm

Hi All,

I want to sample a small part of data from a massive, imbalance dataset, and then split it as training part and validation part.

For the sample part, I know the WeightedRandomSampler is the best choice.
For the split part, I use SubsetRandomSampler before.

But I don’t know how to sample then split because the WeightedRandomSampler will return a dataloader, which cannot put into the SubsetRandomSampler as a sampled dataset.

So, how to do that?

Thanks!

ptrblck · October 5, 2018, 5:21am

For the SubsetRandomSampler you would need to provide indices, so we have to get the balanced indices from WeightedRandomSampler.
One way would be to create the sampler and instead of returning the data we could return the indices and store them somehow. This doesn’t really sound like a good approach, so let’s instead directly get the indices using torch.multinominal.

I assume you have already the sample weights for your dataset.
This line of code will return len(target) balanced indices:

indices = torch.multinomial(weights, num_samples=len(target), replacement=True)

Once you have these indices you can split them and feed to the SubsetRandomSampler.
If you want to split them in a stratified manner, you can use sklearn.model_selection.train_test_split with stratify=target[indices].

As a small side note: WeigthedRandomSampler is of class Sampler and will be fed to a DataLoader.

taiky · October 5, 2018, 3:21pm

oh yes, you are right! I should read the code more carefully.

For images, I found a ugly way to do this:

sample_idx = torch.randperm(len(dataset))[:sample_size]
dataset.samples = np.array(dataset.samples)[sample_idx.tolist()].tolist()

But your way is cleaner and more standard. Thanks!