Hello everyone! I have a small val dataset (~386 entries), so with a batch_size=256 the dataloader only does two iterations. With num_workers <= 1 that is. As I found out, setting num_workers to a larger number yields copies of batches: for num_workers=2, the dataset is yielded twice; for num_workers=4 the dataset is yielded 4 times (totalling to 8 batches), and so on. The batches are not empty, so since I stumbled across this on a prediction step, I got a dataset several times bigger than the original one. This took a while for me to figure out, it’s not indicated in the docs, and it seems that such behavior is unexpected.
That would be interesting behavior indeed but I cannot reproduce it on my end:
$ cat dataloader.py
import torch
class ToyDataset(torch.utils.data.Dataset):
def __getitem__(self, idx):
data = torch.tensor(idx)
return data
def __len__(self):
return 386
dataset = ToyDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=256, num_workers=2)
for batch in loader:
print(batch.size(0))
$ python3 dataloader.py
256
130
Thank you very much for answering! I could not reproduce it as well; the problem seems to have fixed itself. And yet I distinctly remember changing the num_workers argument and seeing the len(dm.dataloader()) change accordingly. Sorry I couldn’t be of more help.
Finally was able to reproduce, sorry for bumping this old thread. Seems the issue is with using the IterableDataset
:
import torch
class ToyDataset(torch.utils.data.IterableDataset):
def __iter__(self):
data = torch.arange(len(self))
yield from data
def __len__(self):
return 386
dataset = ToyDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=256, num_workers=2)
print(len(loader), len(list(loader))) # 2 4