TensorDataset with lazy loading?

I have hundreds of gigs of data in the form of hundreds of thousands of files to use for training. I cannot load it all at once with torch.utils.data.TensorDataset() because I don’t have enough RAM to hold it all at once. How do I “lazily load” data into a TensorDataset, such that it reads in the training files on an as-needed basis?

Use numpy memory map.

That would work if I had one large file, but I have hundreds of thousands of small binary files that add up to 100s of gigabytes.

Yes, but you can construct this huge file or split it in several big files for convenience. Alternatively, you also have hfd5 format that allows lazy loading and carrying metadata.

Anyway, it seem your use case would be better solved with a custom dataset that loads each file on-the-fly.Have a look at:
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

So the short answer is:
If huge dataset / array, use hdf5 or memory map.
If hundreds of small files, use a custom dataset.

1 Like

So basically overload __getitem__.
I tried that, but I/O becomes a big bottleneck, unless I do some fancy caching or something.
Perhaps HDF5 is the best way to go.