Custom dataset delivers always idx 0

I’m experiencing a very strange behavior, which I wouldn’t expect.

I created this minimal example here to show the problem.
First I created a Dataset:

from torch.utils.data import Dataset
class SquadDataset(Dataset):

    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings)

    def __getitem__(self, idx):
        print(idx)
        return self.encodings[idx]

and then running the following minimal cript:

lst = [1, 3, 4, 5, 6, 7, 8, 9, 10]

d = SquadDataset(encodings = lst)

for _ in range(10):

    e = next(iter(d))

I get only a long list of zeros.

Shouldn’t idx be a random number?
Shouldn’t the method getitem call call before generating the number? Using a debugger I realized, that this is not the case (it does not call call but getitem directly)

You are recreating the iterator instead of calling next on it. This will work:

d = SquadDataset(encodings = lst)
iter_d = iter(d)
for _ in range(10):
    e = next(iter_d)

Output:

0
1
2
3
4
5
6
7
8
9
Traceback (most recent call last):

  Cell In[78], line 2
    e = next(iter_d)

StopIteration

Thanks,

but the problem was that I didn’t wrap the Dataset in the DataLoader class.

This worked and solved the problem:

data = DataLoader(SquadDataset, batch_size = 8)

...
iter_d = iter(data)
for _ in range(10):
    e = next(iter_d)
....

No, your approach won’t solve the actual issue of recreating the iterator as shown in my example and you are still returning the same samples defined by the batch_size, i.e. you are missing the last samples.
If you reduce the batch_size to 1, you are seeing the same issue:

data = DataLoader(SquadDataset(encodings = lst), batch_size = 1)
for _ in range(10):
    e = next(iter(data))
# 0
# 0
# 0
# 0
# 0
# 0
# 0
# 0
# 0
# 0  

I tried your solution.

It works.
But even my proposed solution works…

I’m going to recreate the file from scratch and then I report it to you.

Your solution is using a DataLoader would only work if batch_size >= num_samples, i.e. if you are returning all samples in a single next(iter(loader)) call. In your example you are never receiving the last list entry (10) and you can use my code snippet to see that recreating the iterator inside the loop is causing the issue.

I’m back here because I tried your way, but it does not deliver, what it is expected.
I mean:

  1. The method len is never called. Which is not the case using the code I posted above
  2. idx increase incrementally from 0 to the last element of the range. But idx should be a random number

Can you point me to the right direction?

  1. The __len__ attribute is used to initialize the sampler used in a DataLoader. I’m currently unsure if you are trolling or just show no interest in understanding why recreating the iterator inside a nested loop will not sample from the entire dataset. I’ve already posted code snippets which are executable and which you can copy/paste to run. With that being said, feel free to stick to your approach. For other users, please don’t use it as you approach will miss samples and make sure to create the iterator outside the sampling loop.

  2. Use shuffle=True in a DataLoader.