How to divide the dataset when it is distributed

sunshk1227 · May 15, 2020, 9:32am

Hi everyone!

I am a beginner in PyTorch. Now I want to divide a dataset into two parts: the train set and validation set when using torch.distributed. I know that on a single GPU I can do this using a sampler:

indices = list(range(len(train_data)))
train_loader = torch.utils.data.DataLoader(
      train_data, batch_size=args.batch_size,
      sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]),
      pin_memory=True, num_workers=2)

But when I want to train it in a parallel way using torch.distributed, I have to use another sampler, namely, sampler = torch.utils.data.distributed.DistributedSampler(train_data)

So how should I do to use the two samplers, so that I can divide the dataset and distribute it at the same time?

Thank you very much for any help!

sunshk1227 · May 15, 2020, 10:27am

Yeah, I find a solution with the help of Szymon Maszke. Use torch.utils.data.random_split instead. Namely,

train_data, val_data = torch.utils.data.random_split(
      train_data, (num_train, num_val))

mrshenli · May 15, 2020, 2:36pm

cc @vincentqb just in case if there are different suggestions.