How to load large datasets that cannot fit into memory

kyy · June 1, 2022, 7:22pm

I have read this tutorial regarding how to directly load images from files: Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.11.0+cu102 documentation, but my use case is not exactly the same, in that each image can be easily saved as an individual file, while for my dataset (which is a 4D tensor of shape 1000000 * 100 * 10 * 50) it is not feasible to save every single data point to a file. It is possible to chunk the data into several (e.g. 10) files, but it is difficult to do “full shuffling” in this way (“full shuffling” means randomly selecting data points from entire dataset, not just a few chunks). I am wondering if there are any examples to deal with this use case?

Thank you!

JuanFMontesinos · June 1, 2022, 11:53pm

Just use numpy memory map arrays.

ejguan · June 6, 2022, 8:59pm

You might want to explore TorchData using iterative DataPipe to achieve loading data lazily. We have build in shuffle operation, which you can specify shuffle_buffer for the streaming case.
We are going to release 0.4.0 with pytorch 1.12 in the coming weeks. Within this release, DataPipe becomes fully BC with DataLoader in terms of shuffle determinism and automatic sharding