Is it possible to split all large models and perform training in parallel over multiple GPUs?

Is it possible to split all large models and perform training in parallel over multiple GPUs?

If so, what is the point of creating/using GPUs with larger VRAM?

If it’s because a GPU with larger VRAM is faster, how much faster is a GPU with 80GB VRAM than a dual GPU setup of the same GPU with 40GB VRAM?

I am asking this question because I have read many times that “If you want to train larger models, you need a GPU with larger VRAM.” However, if we can split the model and train it through multiple GPUs in parallel, or even a single GPU with smaller VRAM, it means what they are saying isn’t entirely accurate?

Your question has many moving parts that must be fixed in place before it can be answered, such as:
the ratio of FLOPs to memory bandwidth of each GPU,
the interconnect speed between each GPU (e.g., is this PCI-E or NVLink, and is there P2P support, …),
the model sharding strategy (TP, PP, DDP, FSDP, or some combination of them?),
the number of GPUs being compared.

Intuitively, whenever you are using more GPUs, there will be some overhead introduced (bubbles in the pipeline if pipeline parallelism is used, activation communication if model parallel is used, gradient allreduce communication of data parallel is used, weight communication if fully-sharded data-parallel is used) as some examples which would either be reduced or nonexistent in the single GPU case. Whether the additional computational FLOPs of adding GPUs makes up for this overhead depends on the communication required by your use case and the communication speed of the GPU relative to its computational throughput. This is a recent paper that shows how complex the task of choosing the best parallelization strategy for a given number of GPUs can be: 2304.11277.pdf (

1 Like

I really liked your detailed answer!
I asked this question pairing a couple of 4090s in my mind. So the communication method between GPUs would be PCIe and no p2p support based on the other topic I read about multiple 4090s build.
TFLOPS to memory bandwidth is 2.6 I think.
I don’t know about the sharding strategy.

I just wanted to ask a general question to see is a dual 4090s build somehow a waste of money(compared to a dual 3090s or a single 4090 setup) or it’s worth it?