I’m training a CNN to do statistical downscaling of climate data (this is similar to super resolution but with some distinct differences) and my dataset consists of netCDF files (both input and ground truth). Loading all data from disk to the GPU takes perhaps half a minute with the dataset I’m currently using (I’m not by the computer now so I cannot measure the time), but the neural network is fairly small, so each epoch still only takes about two seconds. This means that loading the data takes about 15 times as long as the time of one epoch.
For a small dataset that fits on the GPU, the loading time is not a problem, since I only need to load it and put it on the GPU once (in the beginning of the training).
However, for a larger dataset that I’m planning to use, which doesn’t fit on the GPU, and not in RAM either, a naive solution would consist of loading a part of the training set (as much as can be loaded at once), training once with each example in that part, before discarding it and loading the next part, and so on until the entire training dataset had been loaded and trained with once, then that would be repeated every epoch. This would mean that all training data would suddenly need to be reloaded from disk onto the GPU every epoch, which would mean that over 90% of the time during each epoch would be spent loading data, and very little time would be spent doing actual training, which would make each epoch much less time efficient.
To remedy this slightly, I have come up with two ideas:
-
The training data could be shuffled (so that two consecutive training examples aren’t correlated) and stored on the disk in chunks (netCDF files or npy files using NumPy’s save function), small enough to fit on the GPU. Each chunk could then be loaded and used for training several times (maybe 10 or 20 times) before it is discarded and a new chunk is loaded. That way, the time spent on actual training during each epoch could be bumped from a few percent to maybe 50% or more. One risk I see with this method, though, is that the network could get almost a bit overtrained on the current part, since you use it multiple times, before switching to a new part, and the more times you train on the same part, the less effective each “epoch” (mini epoch?) on that part is going to be. I don’t know how much this affects performance in practice, though.
-
Multiple GPUs could be used, and the training data could be split and stored in a distributed fashion across all GPUs. However, this means that the weights of the network have to be synced between the GPUs in some way (which I’m sure is possible; I just don’t know how). (It also means that I need to have access to many GPUs at once, and although I’m on a computer cluster, it is still up to the job scheduler whether I will be granted the number of GPUs I’m requesting.) Are there any good guides about how to train a neural network on multiple GPUs in PyTorch?
Maybe a combination of these two ideas could be used.
But do these ideas sound reasonable, or are there better ways? Is there any recommended way to go about training with large datasets when using PyTorch? How is it usually tackled?