Dataloader loading the same indexes in a batch

shadow · December 22, 2022, 10:52am

I’m building a dataloader, I wrote it as follows:

train_sampler = RandomSampler(train_dataset,num_samples=100)
valid_sampler = RandomSampler(valid_dataset,num_samples=50)
train_data_loader = DataLoader(train_dataset,
shuffle = False,
batch_size = self.train_config.train_batch_size,
sampler = train_sampler,
)
valid_data_loader = DataLoader(valid_dataset,
shuffle = False,
batch_size = self.train_config.valid_batch_size,
sampler = valid_sampler,
)

The problem is that the dataloader is getting a batch with the same index, for example i want to get a batch of 32 elements, all the indexes are the same index, i double checked with the getitem and iterated with next(iter(trainsampler)) the indexes change but with the loader it’s the same index, so basically iterating over the same point for each epoch.

Could you help me to sort this out?

Thanks

ptrblck · December 22, 2022, 8:10pm

I cannot reproduce the issue and get the expected results:

class MyDataset(Dataset):
    def __init__(self, length=10):
        self.data = torch.arange(length).unsqueeze(1)
        
    def __getitem__(self, index):
        print("calling __getitem__ with index {}".format(index))
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)


train_dataset = MyDataset()
for x in train_dataset:
    print(x)
# calling __getitem__ with index 0
# tensor([0])
# calling __getitem__ with index 1
# tensor([1])
# calling __getitem__ with index 2
# tensor([2])
# calling __getitem__ with index 3
# tensor([3])
# calling __getitem__ with index 4
# tensor([4])
# calling __getitem__ with index 5
# tensor([5])
# calling __getitem__ with index 6
# tensor([6])
# calling __getitem__ with index 7
# tensor([7])
# calling __getitem__ with index 8
# tensor([8])
# calling __getitem__ with index 9
# tensor([9])
# calling __getitem__ with index 10


train_sampler = RandomSampler(train_dataset,num_samples=100)

train_data_loader = DataLoader(train_dataset, shuffle=False, batch_size=2, sampler=train_sampler)
for x in train_data_loader:
    print(x)
# calling __getitem__ with index 5
# calling __getitem__ with index 2
# tensor([[5],
#         [2]])
# calling __getitem__ with index 0
# calling __getitem__ with index 6
# tensor([[0],
#         [6]])
# calling __getitem__ with index 1
# calling __getitem__ with index 8
# tensor([[1],
#         [8]])
# calling __getitem__ with index 4
# calling __getitem__ with index 3
# tensor([[4],
#         [3]])
# calling __getitem__ with index 7
# calling __getitem__ with index 9
# tensor([[7],
#         [9]])
# calling __getitem__ with index 1
# calling __getitem__ with index 4
# tensor([[1],
#         [4]])
# calling __getitem__ with index 9
# calling __getitem__ with index 2
# tensor([[9],
#         [2]])
# calling __getitem__ with index 5
# calling __getitem__ with index 6
# tensor([[5],
#         [6]])
# ...

You can see that the DataLoader shuffles the indices and returns the desired 100 samples.

If you are calling this code directly as posted above, note that it will recreate a new iterator in each call, so you might want to split the iter and next calls.

shadow · December 25, 2022, 9:00am

I tried to reproduce the same results but still the same issue.

train_sampler = RandomSampler(TrainData,num_samples=10)

i tried to test to see if the data is loaded properly and here is what i got

and tested on the loaded batch but the results differ. Noting that if I compute x[‘data’][1][11] i will get the same tensor as indx 0

ptrblck · December 25, 2022, 7:46pm

I don’t know where the issue might be coming from, so feel free to post a minimal, executable code snippet reproducing it. Your previous code snippet did not show the issue as mentioned before.

shadow · December 26, 2022, 7:46am

I realized from where the problem is coming.
It was from the getitem, I wrote it as follows at the beginning.

def getitem(self,item):

  .....
  data = dict()
  data['features'] = ...
  data['target'] = ....
  return data

So i was returning a dict not (x,y) which confused the dataloader. After modifying the output format it worked as a charm. Many thanks for your time and help in this regard.