How to Perform Multiple Inferences on Multiple GPUs in PyTorch?

dmp58 · February 18, 2025, 1:31pm

Hi,

I need to perform inference using the same model on multiple GPUs inside a Docker container. I want to run inference on multiple input data samples simultaneously across different GPUs within the Docker environment, so each GPU is processing a different input batch in parallel using the same model.

However, it seems that the GPUs are not being fully utilized. While one GPU is processing, the other seems to be idle. I would like to ensure that both GPUs are being used simultaneously for inference and that the workload is distributed effectively.

To clarify, my goal is to:

Replicate the same model across multiple GPUs.
Distribute input batches across GPUs in parallel.
Ensure both GPUs are actively processing during inference without one resting.

Any help or guidance on how to best set this up using PyTorch inside Docker would be greatly appreciated!

ptrblck · February 18, 2025, 6:56pm

You could try to implement a custom multi-processing solution or use inference libs, such as torchserve or Triton Inference Server.

dmp58 · February 19, 2025, 10:55am

Hi,

I am already using torch.multiprocessing for parallel inference, but I still notice that when one GPU is performing inference, the other remains idle. My goal is to have both GPUs running inference simultaneously on different batches. Is there something I might be missing in my setup?