Running Inference on multiple GPUs

I have a model that accepts two inputs. I want to run inference on multiple GPUs where one of the inputs is fixed, while the other changes. So, let’s say I use n GPUs, each of them has a copy of the model. First gpu processes the input pair (a_1, b), the second processes (a_2, b) and so on. All the outputs are saved as files, so I don’t need to do a join operation on the outputs. How can I do this with DDP or otherwise?

You wouldn’t need to use DDP but could directly execute the forward passed on both models located on the different GPUs.
However, you would need to check if your overall workload is CPU-limited as I assume both executions should be running in parallel. DDP avoids running into the GIL by using multiple processes (you could do the same). You could also try to use CUDA Graphs, which will reduce the CPU overhead and could allow your CPU to run ahead and schedule the execution of both models without running behind.

Thank you for your reply. I am currently planning to use mp.spawn to launch a function that accepts the rank of the gpu and prepares the input based on the rank (input would be (a_rank, b)). Am I thinking in the right direction? If yes, are there any caveats I should be on the lookout for?

Yes, I think this approach sounds valid. Just to make sure I understand the use case: you are not planning to communicate between these two models at all to synchronize their parameters etc.?
If so, a very simple approach could just be to run the workload in two different terminals specifying the desired GPU in the script directly or via CUDA_VISIBLE_DEVICES.

You are right, there is no communication between models. However, (part of) their inputs has to match. That’s why I want to do it in a single terminal. I could potentially set all the random seeds and ensure that the inputs are always prepared in the same way across multiple parallel terminal runs, however I want to learn how to use multiprocessing and other distributed features of PyTorch, that’s why I want to do it this way.

Yeah, that makes sense and is a good argument for your approach. Seeding could work, but I would see your multiprocessing approach as more stable since your data comes from a single source.

Hi again, I am stuck on how to pass the input to the spawned processes. All the tutorials I see have each of the spawned process prepare its own input using a DistributedSampler. But I don’t think it works for me here. I want the main process to prepare the inputs and pass them to the spawned processes. How can I do this? Thank you in advance.

You could check torch.multiprocessing as described in these docs and share the tensors with all processes through it.

Thank you, that was helpful.

Incase anyone wants to do something similar, here is a gist I made that has a minimalistic example of running inference on multiple GPUs: Multi GPU inference using `torch.multiprocessing` · GitHub