During inference/prediction (not training) I can run the model with the data. However, when training, it throws a cuda out of memory error. Batch size = 1. I suspect that it happens due to the building of the compute graph maybe?
I have 4 GPUs available to me and I want to use all 4 instead of only 1. Can I somehow pool the memory of all GPUs together?
As I understand, if I use DataParallel or DistributedDataParallel, this essentially copies the process multiple times over the GPUs. So in my case I will have 4 models running in parallel. But my data does not fit on 1 GPU, and I think DataParallel is not the solution here, right?
I saw that I can also use model parallelism by breaking down the model in different chunks and having each GPU process a separate chunk. If the compute graph was indeed the issue before, then this should solve it?
What is the best way to go about dealing with data that does not fit 1 GPU?