I want to sample a small part of data from a massive, imbalance dataset, and then split it as training part and validation part.
For the sample part, I know the WeightedRandomSampler is the best choice.
For the split part, I use SubsetRandomSampler before.
But I don’t know how to sample then split because the WeightedRandomSampler will return a dataloader, which cannot put into the SubsetRandomSampler as a sampled dataset.
So, how to do that?
SubsetRandomSampler you would need to provide indices, so we have to get the balanced indices from
One way would be to create the sampler and instead of returning the data we could return the indices and store them somehow. This doesn’t really sound like a good approach, so let’s instead directly get the indices using
I assume you have already the sample weights for your dataset.
This line of code will return
len(target) balanced indices:
indices = torch.multinomial(weights, num_samples=len(target), replacement=True)
Once you have these indices you can split them and feed to the
If you want to split them in a stratified manner, you can use sklearn.model_selection.train_test_split with
As a small side note:
WeigthedRandomSampler is of class
Sampler and will be fed to a
oh yes, you are right! I should read the code more carefully.
For images, I found a ugly way to do this:
sample_idx = torch.randperm(len(dataset))[:sample_size]
dataset.samples = np.array(dataset.samples)[sample_idx.tolist()].tolist()
But your way is cleaner and more standard. Thanks!