Training on a subset of data for few epochs and then proceeding to the next subset for few epochs and so on?

wolfram · January 4, 2019, 7:26am

My training data is very huge and it’s impossible to load all of it at once even into main memory. So I’m loading a few blocks (subset) of data and training till convergence, then proceeding to next subset and training till convergence and so on. Is it the right approach ?

The model performance kind of remains the same even when training on a new subset of data.

Is this method fundamentally wrong, why ?

I know this question is not specific to pytorch, I’m sorry, but I find this forum very active.

Thanks!

ptrblck · January 4, 2019, 9:09pm

I guess it depends on the size of this subset and how well this subset reflects the data distribution of the complete dataset. If some subsets contain a highly class imbalance, you will fit your model on these majority classes, which most likely will hurt your overall performance.
If it’s possible, I would rather lazily load the data (using Dataset and DataLoader) and use the complete dataset.