Unequal sampling from two different datasets

tvdhout · April 21, 2020, 4:31pm

Hi all!

I have two datasets, dataset 1 has ±400 samples and dataset 2 ±1000 samples. I want to train my network such that it sees one sample from dataset 1 for every 5 samples of dataset 2 (roughly).

With a batch size of 32, each batch containing 5 samples from dataset 1 and 27 from dataset 2 would work. How would I go about setting up a DataLoader to achieve this? I want to definitely train on all 1000 samples from dataset 2 every epoch, and just randomly sample however much I need from dataset 2 to achieve that distribution.

I looked at ConcatDataset but this just combines the two into one large dataset and samples from that. I saw DataLoader has a sampler parameter; could I achieve what I want with a WeightedRandomSampler?

Any help is appreciated!

ptrblck · April 22, 2020, 8:01am

One approach would be to create separate DataLoaders with batch_size=5 (small dataset) and =27 (large dataset), respectively.
In the outer loop you would iterate the large dataset to make sure you are sampling all data points.
You could create an iterator for the small dataset via small_iter = iter(small_loader) before entering the loop and sample the batch from the small dataset.
Once you have both batches, you could concatenate them via data = torch.cat((small, large), dim=0) to create your final batch.

I think the WeightedRandomSampler wouldn’t strictly work, as you need to sample all samples from the large dataset.

tvdhout · April 22, 2020, 2:17pm

Thanks! smart solution