Best practices for training with large datasets?

I’m training a CNN to do statistical downscaling of climate data (this is similar to super resolution but with some distinct differences) and my dataset consists of netCDF files (both input and ground truth). Loading all data from disk to the GPU takes perhaps half a minute with the dataset I’m currently using (I’m not by the computer now so I cannot measure the time), but the neural network is fairly small, so each epoch still only takes about two seconds. This means that loading the data takes about 15 times as long as the time of one epoch.

For a small dataset that fits on the GPU, the loading time is not a problem, since I only need to load it and put it on the GPU once (in the beginning of the training).

However, for a larger dataset that I’m planning to use, which doesn’t fit on the GPU, and not in RAM either, a naive solution would consist of loading a part of the training set (as much as can be loaded at once), training once with each example in that part, before discarding it and loading the next part, and so on until the entire training dataset had been loaded and trained with once, then that would be repeated every epoch. This would mean that all training data would suddenly need to be reloaded from disk onto the GPU every epoch, which would mean that over 90% of the time during each epoch would be spent loading data, and very little time would be spent doing actual training, which would make each epoch much less time efficient.

To remedy this slightly, I have come up with two ideas:

  1. The training data could be shuffled (so that two consecutive training examples aren’t correlated) and stored on the disk in chunks (netCDF files or npy files using NumPy’s save function), small enough to fit on the GPU. Each chunk could then be loaded and used for training several times (maybe 10 or 20 times) before it is discarded and a new chunk is loaded. That way, the time spent on actual training during each epoch could be bumped from a few percent to maybe 50% or more. One risk I see with this method, though, is that the network could get almost a bit overtrained on the current part, since you use it multiple times, before switching to a new part, and the more times you train on the same part, the less effective each “epoch” (mini epoch?) on that part is going to be. I don’t know how much this affects performance in practice, though.

  2. Multiple GPUs could be used, and the training data could be split and stored in a distributed fashion across all GPUs. However, this means that the weights of the network have to be synced between the GPUs in some way (which I’m sure is possible; I just don’t know how). (It also means that I need to have access to many GPUs at once, and although I’m on a computer cluster, it is still up to the job scheduler whether I will be granted the number of GPUs I’m requesting.) Are there any good guides about how to train a neural network on multiple GPUs in PyTorch?

Maybe a combination of these two ideas could be used.

But do these ideas sound reasonable, or are there better ways? Is there any recommended way to go about training with large datasets when using PyTorch? How is it usually tackled?

Having the chance of storing your whole training set in the GPU is is a rara avis.
The most typical scenario is allocating the data into the GPU batch-wise. That’s pretty straight forward as all the tooling are designed to do that.

It’s true that, in your case, training on multiple gpus with Sync could be benefitial in terms of training time. That’s carried out by the DistributedDataParallel module. Although it is intended to allocate data from cpu rather than having in already on the GPU, maybe there is a workaround for your use case.

Lastly, clusters are usually deterministic. If you ask for 2 gpus, it will grant 2. Yet a different thing is the queuing time get larger as you are asking for more resources.

Thank you for your answer! I will check out the DistributedDataParallel module. However, last time I asked for a compute node exclusively for me (one node has eight GPUs), I was queueing for about 24 hours without being assigned a node, so maybe I shouldn’t be too reliant on using multiple GPUs, even though it would be nice too support utilization of multiple GPUs.

Do you know if my strategy of training several times on one “chunk” of data before switching to a new chuck is used in practice? To me, the strategy makes sense, but I haven’t seen any paper or article that mentions using it.

Hmm I don’t know specific papers about it but it doesn’t sound good.

I mean, Ideally, if each subset represents the real distribution… but that’s probably not the case.
Then your model is trained on a distribution and you keep varying between both, not a good idea.

Why don’t you just train allocating batches? I don’t think you are gonna lose much time and you will save it by not struggling with coding a custom approach