I have faced this situation: I have a large dataset of size 1 TB, but my local drive has the capacity of 500GB. I can train a model only on a smaller subset of the original dataset, but I would like to make use of the whole dataset. So I wonder if there is a method or paper about the training of a model in these kinds of situations.
Where is the dataset stored currently?
If it’s stored on a network drive, you could load the data lazily from there, which will most likely slow down your overall training due to the latency.
Otherwise, you would somehow have to swap the old data for the new one (in each epoch), which might also be too slow.
Do you have an external SSD (with USB 3) which might be big enough?
My data is exactly on a network drive. So it seems the only choice now is to purchase an external SSD
Internal drives would be better of course, if that’s possible.