Dataloader yields copied batches

ilykuleshov · April 9, 2024, 1:02pm

Hello everyone! I have a small val dataset (~386 entries), so with a batch_size=256 the dataloader only does two iterations. With num_workers <= 1 that is. As I found out, setting num_workers to a larger number yields copies of batches: for num_workers=2, the dataset is yielded twice; for num_workers=4 the dataset is yielded 4 times (totalling to 8 batches), and so on. The batches are not empty, so since I stumbled across this on a prediction step, I got a dataset several times bigger than the original one. This took a while for me to figure out, it’s not indicated in the docs, and it seems that such behavior is unexpected.

eqy · April 9, 2024, 4:52pm

That would be interesting behavior indeed but I cannot reproduce it on my end:

$ cat dataloader.py
import torch

class ToyDataset(torch.utils.data.Dataset):
    def __getitem__(self, idx):
        data = torch.tensor(idx)
        return data

    def __len__(self):
        return 386

dataset = ToyDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=256, num_workers=2)
for batch in loader:
    print(batch.size(0))

$ python3 dataloader.py
256
130

ilykuleshov · April 26, 2024, 12:21pm

Thank you very much for answering! I could not reproduce it as well; the problem seems to have fixed itself. And yet I distinctly remember changing the num_workers argument and seeing the len(dm.dataloader()) change accordingly. Sorry I couldn’t be of more help.

ilykuleshov · December 10, 2024, 10:29am

Finally was able to reproduce, sorry for bumping this old thread. Seems the issue is with using the IterableDataset:

import torch


class ToyDataset(torch.utils.data.IterableDataset):
    def __iter__(self):
        data = torch.arange(len(self))
        yield from data

    def __len__(self):
        return 386


dataset = ToyDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=256, num_workers=2)
print(len(loader), len(list(loader))) # 2 4