Inference multiple models simultaneously

I am trying to find a simple way to run a forward on a batch on two models on two GPUs at the same time. That means I do not want to distribute the batch of the same model across devices, but two models across devices. I thought that simple python multiprocessing can work, but running into issues due to pickling of the models and memory-related things.

So I want to do:

model1 = model1.to("cuda:0")
model2 = model2.to("cuda:1")

# run simultaneously
model1(batch)
model2(batch)

Any idea how to do this elegantly? In best case in a notebook environment.

In the simplest version, you don’t need multiprocessing and can take advantage of the async nature of CUDA.

In particular, the following will run the CUDA kernels corresponding to each model forward pass in parallel, so only CPU-side launches will not be parallelized:

model1 = model1.to("cuda:0")
model2 = model2.to("cuda:1")

# run simultaneously
0_batch = batch.to(0)
1_batch = batch.to(1)
# CUDA kernels will run in parallel as they execute async, and GPU 0 kernels don't block GPU 1 kernels.
model1(0_batch)
model2(1_batch)