Problem loading parallel datasets even after using SubsetRandomSampler

Krishna_Garg · November 22, 2022, 6:48pm

I have two parallel datasets dataset1 and dataset2 and following is my code to load them in parallel using SubsetRandomSampler where I provide train_indices for dataloading.

P.S. Even after setting num_workers=0 and seeing np.random, the samples do not get loaded in parallel. Any suggestions are heartily welcome including methods other than SubsetRandomSampler.

import torch, numpy as np
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

train_indices = list(range(len(dataset1)))
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = SubsetRandomSampler(train_indices)

dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)

for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
  x = data1
  y = data2
  print(x, y)

Output:

tensor([5, 1]) tensor([15, 18])
tensor([0, 2]) tensor([14, 12])
tensor([4, 6]) tensor([16, 10])
tensor([8, 9]) tensor([11, 19])
tensor([7, 3]) tensor([17, 13])

Expected Output:

tensor([5, 1]) tensor([15, 11])
tensor([0, 2]) tensor([10, 12])
tensor([4, 6]) tensor([14, 16])
tensor([8, 9]) tensor([18, 19])
tensor([7, 3]) tensor([17, 13])

ptrblck · November 23, 2022, 7:04am

Since you are using a random sampler, the random indices are expected.
If you want to yield the same (shuffled) indices from both DataLoaders, create the indices first, and use a custom sampler:

class MySampler(torch.utils.data.sampler.Sampler):
    def __init__(self, indices):
        self.indices = indices
        
    def __iter__(self):
        return iter(self.indices)
    
    def __len__(self):
        return len(self.indices)


dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

train_indices = list(range(len(dataset1)))
np.random.seed(12)
np.random.shuffle(train_indices)

sampler = MySampler(train_indices)

dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)

for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
  x = data1
  y = data2
  print(x, y)

Output:

tensor([5, 8]) tensor([15, 18])
tensor([7, 0]) tensor([17, 10])
tensor([4, 9]) tensor([14, 19])
tensor([3, 2]) tensor([13, 12])
tensor([1, 6]) tensor([11, 16])

You could also try to pass a generator to the SubsetRandomSampler, but I would recommend to create the indices explicitly since trying to manipulate the PRNG often yields misunderstandings and errors.