Hi, I have a sizeable pre-trained model and I want to get inference on multiple GPU from it(I don’t want to train it).so is there any way for that?
In summary, I want model-parallelism. and if there is a way, how is it done?
@Milad_Yazdani There are multiple options depending on the type of model parallelism you want. There is PyTorch FSDP: FullyShardedDataParallel — PyTorch 1.11.0 documentation which is ZeRO3 style for large models. There is very recent Tensor Parallelism support (see this example: examples/distributed/sharded_tensor at main · pytorch/examples · GitHub) and there is pipeline parallelism support too: GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch
If you could share more details about your model and setup we can help in proposing what might be the best fit here:
- How big is the model (number of parameters) and how many GPUs do you want to use?
- Do you want to split the model across multiple GPUs on a single host or is the model large enough that it needs to be split across multiple hosts?
- Since this is GPU inference, I’m assuming you want to optimize for latency?
cc @mrshenli Regarding distributed inference.
PiPPy (Pipeline Parallelism for PyTorch) supports distributed inference.
PiPPy can split pre-trained models into pipeline stages and distribute them onto multiple GPUs or even multiple hosts. It also supports distributed, per-stage materialization if the model does not fit in the memory of a single GPU. When you have multiple microbatches to inference, pipeline parallelism is achieved.
Here is an inference example of large T5 models using PiPPy, with detailed user guide: