Model parallelism vs. Data parallelism: Training data does not fit on GPU. batch_size=1

During inference/prediction (not training) I can run the model with the data. However, when training, it throws a cuda out of memory error. Batch size = 1. I suspect that it happens due to the building of the compute graph maybe?
I have 4 GPUs available to me and I want to use all 4 instead of only 1. Can I somehow pool the memory of all GPUs together?

As I understand, if I use DataParallel or DistributedDataParallel, this essentially copies the process multiple times over the GPUs. So in my case I will have 4 models running in parallel. But my data does not fit on 1 GPU, and I think DataParallel is not the solution here, right?

I saw that I can also use model parallelism by breaking down the model in different chunks and having each GPU process a separate chunk. If the compute graph was indeed the issue before, then this should solve it?

What is the best way to go about dealing with data that does not fit 1 GPU?

You could check tau as an approach of pipeline parallelism or e.g. torch.utils.checkpoint to trade compute for memory.

1 Like

Oh I haven’t encountered pipeline parallelism before. Is this conceptually a combination of data and model parallelism?

I would see it as a model parallel technique, since the actual model is split onto different devices while the data stays untouched.