How to Perform Multiple Inferences on Multiple GPUs in PyTorch?

Hi,

I need to perform inference using the same model on multiple GPUs inside a Docker container. I want to run inference on multiple input data samples simultaneously across different GPUs within the Docker environment, so each GPU is processing a different input batch in parallel using the same model.

However, it seems that the GPUs are not being fully utilized. While one GPU is processing, the other seems to be idle. I would like to ensure that both GPUs are being used simultaneously for inference and that the workload is distributed effectively.

To clarify, my goal is to:

  • Replicate the same model across multiple GPUs.
  • Distribute input batches across GPUs in parallel.
  • Ensure both GPUs are actively processing during inference without one resting.

Any help or guidance on how to best set this up using PyTorch inside Docker would be greatly appreciated!

You could try to implement a custom multi-processing solution or use inference libs, such as torchserve or Triton Inference Server.

Hi,

I am already using torch.multiprocessing for parallel inference, but I still notice that when one GPU is performing inference, the other remains idle. My goal is to have both GPUs running inference simultaneously on different batches. Is there something I might be missing in my setup?