Training on a subset of data for few epochs and then proceeding to the next subset for few epochs and so on?

My training data is very huge and it’s impossible to load all of it at once even into main memory. So I’m loading a few blocks (subset) of data and training till convergence, then proceeding to next subset and training till convergence and so on. Is it the right approach ?

The model performance kind of remains the same even when training on a new subset of data.

Is this method fundamentally wrong, why ?

I know this question is not specific to pytorch, I’m sorry, but I find this forum very active.

Thanks!

I guess it depends on the size of this subset and how well this subset reflects the data distribution of the complete dataset. If some subsets contain a highly class imbalance, you will fit your model on these majority classes, which most likely will hurt your overall performance.
If it’s possible, I would rather lazily load the data (using Dataset and DataLoader) and use the complete dataset.

2 Likes