Prediction GPU parallel issue

Hey guys,

I have a trained 3D-CNN model. It takes almost all GPU memory to run one example.

So I wanna use multi GPU to run the prediction at the same time.

Here’s my code:

import torch
import make_my_model

dummy_input_0 = torch.rand(1, c, h, w, l, device=torch.device('cuda:0')
dummy_input_1 = torch.rand(1, c, h,w, l, device=torch.device('cuda:1')

with torch.no_grad():
    model_0 = make_my_model().to(device0)
    model_1 = make_my_model().to(device1)
    
    out_0 = model_0(dummy_input_0)
    out_1 = model_1(dummy_input_1)

This code gives the right output but it seems won’t work async.

What is the proper way to run the prediction parallelly on two (maybe more) GPUs?

Thanks, guys!

I would guess you might not see a parallel execution since your CPU might not be able to schedule the kernels fast enough and/or you might have (unwanted) synchronizations in your code.
In the default “eager” mode the CPU needs to dispatch to the internal operator implementation and dispatches the CUDA kernel.
If your model has e.g. 100 layers (and for simplicity let’s claim each layer is calling into a single CUDA kernel), the CPU would need to schedule 100 kernels in model_0 before it can start scheduling the workload in model_1.
Assuming the CPU is fast enough and you are not blocking it with synchronizations, you should see an overlap and could use a profiler (e.g. the native PyTorch profiler or Nsight Systems) to check it.

However, if your overall workload is already CPU-limited, you would of course not see any overlap.
This should already be visible while running a single model in a profiler and Nsight Systems would show “whitespaces” between the actual kernel execution.

If you are working with static input shapes (and meet other requirements) you could try to use CUDA Graphs as described here to reduce the kernel launch overheads.