Inference on multi GPU

Hi, I have a sizeable pre-trained model and I want to get inference on multiple GPU from it(I don’t want to train it).so is there any way for that?
In summary, I want model-parallelism. and if there is a way, how is it done?

@Milad_Yazdani There are multiple options depending on the type of model parallelism you want. There is PyTorch FSDP: FullyShardedDataParallel — PyTorch 1.11.0 documentation which is ZeRO3 style for large models. There is very recent Tensor Parallelism support (see this example: examples/distributed/sharded_tensor at main · pytorch/examples · GitHub) and there is pipeline parallelism support too: GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch

If you could share more details about your model and setup we can help in proposing what might be the best fit here:

  1. How big is the model (number of parameters) and how many GPUs do you want to use?
  2. Do you want to split the model across multiple GPUs on a single host or is the model large enough that it needs to be split across multiple hosts?
  3. Since this is GPU inference, I’m assuming you want to optimize for latency?
cc @mrshenli Regarding distributed inference.

PiPPy (Pipeline Parallelism for PyTorch) supports distributed inference.

PiPPy can split pre-trained models into pipeline stages and distribute them onto multiple GPUs or even multiple hosts. It also supports distributed, per-stage materialization if the model does not fit in the memory of a single GPU. When you have multiple microbatches to inference, pipeline parallelism is achieved.

Here is an inference example of large T5 models using PiPPy, with detailed user guide: