I am using a dataset that is quite large (like 5 Terabytes).
In testing, I used small datasets, using like 300 GB of it; by doing this, the dataloading time is as short as 0.1 seconds. However, when I use the full dataset, the time starts to become variable, going from 0.1 to around 1.0 seconds. Here are two approaches that I have tried:
- Unified the input to a 2D tensor, combined them into a large binary file, and directly read from a random location of the large binary file
- Pickle the data into a LMDB file, and take the data from a random key.
Both of the approaches work with smaller datasets but becomes worse with large dataset. Any ideas how this can be happening and how we can solve the problem?
P.S. I am using DDP for training, so one way that I could think of is to shatter the whole dataset to 8, and since I have 8 GPUs I can just make each card read from each database. Suggestions about this idea would be welcome too!