_DataLoaderIter vs _BaseDataLoaderIter

Raphikowski · November 21, 2019, 4:07pm

Hi all,

I am confused about the Iterator class of DataLoader. In particular I wanted to ask if the implementation has fundamentally changed between some of the pytorch versions?
Since, in the online documentation

github.com

pytorch/pytorch/blob/master/torch/utils/data/dataloader.py

r"""Definition of the DataLoader and associated iterators that subclass _BaseDataLoaderIter

To support these two classes, in `./_utils` we define many utility methods and
functions to be run in multiprocessing. E.g., the data loading worker loop is
in `./_utils/worker.py`.
"""

import torch
import multiprocessing as python_multiprocessing
import torch.multiprocessing as multiprocessing
from . import IterableDataset, Sampler, SequentialSampler, RandomSampler, BatchSampler
from . import _utils
from torch._utils import ExceptionWrapper
import threading
import itertools
from torch._six import queue, string_classes


get_worker_info = _utils.worker.get_worker_info

This file has been truncated. show original

I can only find the classes _BaseDataLoaderIter(object) and its subclasses _SingleProcessDataLoaderIter(_BaseDataLoaderIter) and _MultiProcessingDataLoaderIter(_BaseDataLoaderIter). However, when I look at the anaconda code on my PC in anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py
these three classes do not exist and instead there is only a class called _DataLoaderIter(object) (which seems somewhat similar to the implementation of _MultiProcessingDataLoaderIter, but they re not exactly the same).
Why cant I find the code of _DataLoaderIter(object) in the documentation? Does it have to do with different pytorch versions? If so, what consequences does that have if I use custom dataset, sampler and collate_fn functions? Will they work for either pytorch version?

albanD · November 21, 2019, 4:32pm

Hi,

All the classes that start with an underscore like _Foo are internal and so are not documented and can change between versions without notice.
The latest big change there I can think of is: https://github.com/pytorch/pytorch/pull/19228
Which version of pytorch do you currently have installed?

Raphikowski · November 21, 2019, 5:18pm

Hi
thank you for your reply!
I am currently using Version ‘1.0.1.post2’.
In your link indeed is explained that DataLoaderIter was split up into the two classes I have mentioned above. Does this mean that if I now implement a custom collate_fn, sampler and dataset, that these might not work on a newer pytorch version anymore?

albanD · November 21, 2019, 5:41pm

It should.
Your dataset should only touch the Dataset and Sampler classes. Not the _* ones anyway right?

Raphikowski · November 21, 2019, 6:51pm

Well I try to make the getitem method expecting two indices as parameters such that I can select data from a 3D tensor. To do so, I at least have to rewrite the collate_fn method, too. Maybe even more, but I havent got so far yet.

albanD · November 21, 2019, 7:12pm

Can you linearize your two indices as a larger 1D tensor? That way you can use the base loader.

SimonW · November 21, 2019, 10:14pm

Only implementation details changed so custom dataset, sampler, and collate_fn that use public APIs will work as is.

Raphikowski · November 25, 2019, 9:53am

Thanks a lot for your replies both of you.
I have just realized that in both implementations, the getitem method is always assumed to only take one argument. Since in __DataLoaderIter class there is line 615:
batch = self.collate_fn([self.dataset[i] for i in indices])
and in the other case, when _MapDatasetFetcher is used, there is the line:
data = [self.dataset[idx] for idx in possibly_batched_index]
Thus, both implementations demand that the getitem method necessarily only takes one argument. And since both of the above are internal methods I guess I should not be changing.
So is there really no option that I adjust the getitem method to accept two indices and hence make use of my 3D dataset?