Optimizing simultaneous inference for two distinct models

I’m working with two independent autoregressive models for inference. One takes queries (sequential data) and yields an intermediate sequential output which is piped to the second model to produce the final output (which is sequential data as well).

Both these models are rather heavy, and inference takes from 1 to 10 seconds for each, depending on the sequence length. In order to optimize query scheduling and inference time, I figured that for both models I could operate in a step by step fashion, i.e. by using RNN Cells instead of RNN Layers and by repeatedly calling a step() function for inference. When batching queries, this allows to start a new query as soon as one is finished, rather than having to wait for the entire batch to complete.

Inference for a single query then works as follows:

query: [a_1, a_2, ..., a_N]
pipeline: a_i -> [model 1] -> b_i -> [model 2] -> c_i
output: [c_1, c_2, ..., c_N]

So I’m in a situation where I need to alternate between two models back and forth. From an engineering perspective, it is optimal to run the first pipeline a_i -> b_i and the second pipeline b_i -> c_i in separate processes, with a shared buffer for b_i. This is trivial when using 2 GPUs (although I haven’t implemented it yet), and assuming the operation time to complete pipineline 1 alone is N, and M for pipeline 2, you get a running time of max(N, M).

But what if I am using a single GPU? Operating on the same GPU in a blocking fashion would amount, in the simplest case, to doing something like this:

b1 = model1.step(a1)
c1 = model2.step(b1)
b2 = model1.step(a2)
c2 = model2.step(b2)
b3 = model1.step(a3)

But then you do not leverage the independence of both models, and you end up with a N + M runtime. Indeed, lines 2 and 3, as well as line 4 and 5 could be executed in parallel on 2 GPUs. Is there any way to achieve this on a single GPU? I’m aware that cuda operations are async, but does that mean that different operations can run in parallel on a GPU? If yes, is this trivial to implement (i.e. torch manages the parallelism for independent tasks automatically) or do I need to write async/threaded code?

It sounds like you’re building a DAG framework from scratch - I’d suggest you take a look at something like Airflow, Metaflow or torchserve workflows https://github.com/pytorch/serve/tree/master/examples/Workflows

I don’t think that’s on point. A DAG pipeline would mean a graph with a fixed number of operations, but here I am going back and forth between model A and B a variable number of times.

I looked at torchserve out of the suggestions you gave, but it doesn’t seem to have specific mechanics to handle my usecase, nor could I find an answer to my question of running concurrent inferences on the same CUDA device.