If I have 4 GPUs, is that possible to do the follow:
Use one GPU to store intermediate outputs and aggregates them together to backprop. Mount model and minibatch to the rest 3 GPUs calculating forward.
In this case, I’m able to train larger model with larger batch size. Currently, the ability is bounded by GPU memory as Pytorch always gather all outputs to one GPU which also is responsible for forward() calculation.
This use case sounds like it would require a lot of synchronization, which would most likely make the whole training procedure slow. Sorry, but I’m not sure if this is possible at all.
However, you could have a look at torch.utils.checkpoint and see, if this might help you save some memory.
@ptrblck, thanks for your suggestion! This is a great way to trade compute speed with memory. However, I think I can apply this method to both single GPU and multiple GPU calculation. The memory issue can be alleviated in the same time but the imbalanced memory usage for multiple GPU still exits: Its limit depends on single GPU memory limit.
So far, I had a better experience using single GPU training than multiple GPU training. For example, 1GPU+bsz =20 VS. 2GPU+bsz=40 VS. 4GPU+bsz=80, only about 13% performance improvement is observed when #GPU doubled. In the same time, the convergence speed decreases, since for more GPU and larger batch size, total number of updates decreases in a fixed training time.
In my case, the data is huge. I only care about the accuracy after a certain number of epochs (1~ 5). Although, a larger batch size could end up with better accuracy, I’m far from that point. Therefore, single GPU training is more practical to me at present, unless there is an elegant solution.
I wonder how other deep learning frameworks handle this issue?