Training a model when dataset size is too big to fit the hard drive

torayeff · May 19, 2019, 6:29pm

I have faced this situation: I have a large dataset of size 1 TB, but my local drive has the capacity of 500GB. I can train a model only on a smaller subset of the original dataset, but I would like to make use of the whole dataset. So I wonder if there is a method or paper about the training of a model in these kinds of situations.

ptrblck · May 19, 2019, 9:57pm

Where is the dataset stored currently?
If it’s stored on a network drive, you could load the data lazily from there, which will most likely slow down your overall training due to the latency.
Otherwise, you would somehow have to swap the old data for the new one (in each epoch), which might also be too slow.

Do you have an external SSD (with USB 3) which might be big enough?

torayeff · May 20, 2019, 5:30am

My data is exactly on a network drive. So it seems the only choice now is to purchase an external SSD

ptrblck · May 20, 2019, 7:56am

Internal drives would be better of course, if that’s possible.