TensorDataset with lazy loading?

Geremia · June 6, 2024, 11:06pm

I have hundreds of gigs of data in the form of hundreds of thousands of files to use for training. I cannot load it all at once with torch.utils.data.TensorDataset() because I don’t have enough RAM to hold it all at once. How do I “lazily load” data into a TensorDataset, such that it reads in the training files on an as-needed basis?

JuanFMontesinos · June 7, 2024, 3:22am

Use numpy memory map.

Geremia · June 7, 2024, 10:15am

That would work if I had one large file, but I have hundreds of thousands of small binary files that add up to 100s of gigabytes.

JuanFMontesinos · June 7, 2024, 10:51am

Yes, but you can construct this huge file or split it in several big files for convenience. Alternatively, you also have hfd5 format that allows lazy loading and carrying metadata.

Anyway, it seem your use case would be better solved with a custom dataset that loads each file on-the-fly.Have a look at:
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

So the short answer is:
If huge dataset / array, use hdf5 or memory map.
If hundreds of small files, use a custom dataset.

Geremia · June 7, 2024, 6:28pm

So basically overload __getitem__.
I tried that, but I/O becomes a big bottleneck, unless I do some fancy caching or something.
Perhaps HDF5 is the best way to go.