Implement streaming data using `Dataset` interface class?

According to the spec of torch.utils.data.Dataset:

An abstract class representing a Dataset.

All other datasets should subclass it. All subclasses should override
__len__, that provides the size of the dataset, and __getitem__,
supporting integer indexing in range from 0 to len(self) exclusive.

My problem is that, what if the data comes in online streaming fashion, and I’m not able to find out __len__ at all? Or I just have a very large dataset, intend to iterate over it just once, so don’t care about the __len__ of it.
In both cases, could I ignore this __len__ function when subclassing Dataset safely?

1 Like

You should override __len__ even for a streaming dataset, since it’s called by some other classes. You can set it to the number of examples you want per-“epoch” or just a very large integer.

1 Like