Dataloader loading the same indexes in a batch

I’m building a dataloader, I wrote it as follows:

train_sampler = RandomSampler(train_dataset,num_samples=100)
valid_sampler = RandomSampler(valid_dataset,num_samples=50)
train_data_loader = DataLoader(train_dataset,
shuffle = False,
batch_size = self.train_config.train_batch_size,
sampler = train_sampler,
)
valid_data_loader = DataLoader(valid_dataset,
shuffle = False,
batch_size = self.train_config.valid_batch_size,
sampler = valid_sampler,
)

The problem is that the dataloader is getting a batch with the same index, for example i want to get a batch of 32 elements, all the indexes are the same index, i double checked with the getitem and iterated with next(iter(trainsampler)) the indexes change but with the loader it’s the same index, so basically iterating over the same point for each epoch.

Could you help me to sort this out?

Thanks

I cannot reproduce the issue and get the expected results:

class MyDataset(Dataset):
    def __init__(self, length=10):
        self.data = torch.arange(length).unsqueeze(1)
        
    def __getitem__(self, index):
        print("calling __getitem__ with index {}".format(index))
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)


train_dataset = MyDataset()
for x in train_dataset:
    print(x)
# calling __getitem__ with index 0
# tensor([0])
# calling __getitem__ with index 1
# tensor([1])
# calling __getitem__ with index 2
# tensor([2])
# calling __getitem__ with index 3
# tensor([3])
# calling __getitem__ with index 4
# tensor([4])
# calling __getitem__ with index 5
# tensor([5])
# calling __getitem__ with index 6
# tensor([6])
# calling __getitem__ with index 7
# tensor([7])
# calling __getitem__ with index 8
# tensor([8])
# calling __getitem__ with index 9
# tensor([9])
# calling __getitem__ with index 10


train_sampler = RandomSampler(train_dataset,num_samples=100)

train_data_loader = DataLoader(train_dataset, shuffle=False, batch_size=2, sampler=train_sampler)
for x in train_data_loader:
    print(x)
# calling __getitem__ with index 5
# calling __getitem__ with index 2
# tensor([[5],
#         [2]])
# calling __getitem__ with index 0
# calling __getitem__ with index 6
# tensor([[0],
#         [6]])
# calling __getitem__ with index 1
# calling __getitem__ with index 8
# tensor([[1],
#         [8]])
# calling __getitem__ with index 4
# calling __getitem__ with index 3
# tensor([[4],
#         [3]])
# calling __getitem__ with index 7
# calling __getitem__ with index 9
# tensor([[7],
#         [9]])
# calling __getitem__ with index 1
# calling __getitem__ with index 4
# tensor([[1],
#         [4]])
# calling __getitem__ with index 9
# calling __getitem__ with index 2
# tensor([[9],
#         [2]])
# calling __getitem__ with index 5
# calling __getitem__ with index 6
# tensor([[5],
#         [6]])
# ...

You can see that the DataLoader shuffles the indices and returns the desired 100 samples.

If you are calling this code directly as posted above, note that it will recreate a new iterator in each call, so you might want to split the iter and next calls.

I tried to reproduce the same results but still the same issue.

train_sampler = RandomSampler(TrainData,num_samples=10)

i tried to test to see if the data is loaded properly and here is what i got

and tested on the loaded batch but the results differ. Noting that if I compute x[‘data’][1][11] i will get the same tensor as indx 0

I don’t know where the issue might be coming from, so feel free to post a minimal, executable code snippet reproducing it. Your previous code snippet did not show the issue as mentioned before.

I realized from where the problem is coming.
It was from the getitem, I wrote it as follows at the beginning.

def getitem(self,item):

  .....
  data = dict()
  data['features'] = ...
  data['target'] = ....
  return data

So i was returning a dict not (x,y) which confused the dataloader. After modifying the output format it worked as a charm. Many thanks for your time and help in this regard.