Inference on multi GPU

Hi, I have a sizeable pre-trained model and I want to get inference on multiple GPU from it(I don’t want to train it).so is there any way for that?
In summary, I want model-parallelism. and if there is a way, how is it done?

@Milad_Yazdani There are multiple options depending on the type of model parallelism you want. There is PyTorch FSDP: FullyShardedDataParallel — PyTorch 1.11.0 documentation which is ZeRO3 style for large models. There is very recent Tensor Parallelism support (see this example: examples/distributed/sharded_tensor at main · pytorch/examples · GitHub) and there is pipeline parallelism support too: GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch

If you could share more details about your model and setup we can help in proposing what might be the best fit here:

  1. How big is the model (number of parameters) and how many GPUs do you want to use?
  2. Do you want to split the model across multiple GPUs on a single host or is the model large enough that it needs to be split across multiple hosts?
  3. Since this is GPU inference, I’m assuming you want to optimize for latency?

cc @mrshenli Regarding distributed inference.