Running inference for 3 GPT2 models concurrently is slower than sequentially. How to improve?

Hey team,

I have an NVidia GPU with 24 GB RAM (g5.xlarge on AWS) and 3 differently fine-tuned GPT2 models. Unfortunately, running them concurrently is slower (45ms for 1 token each) than running them sequentially (38ms for 1 token each).

I expected running them concurrently to be close to 3x faster because size of GPU memory is not a problem and GPU utilisiation also seems to be always below 33%.

Any ideas or packages I should try out?

Here is a minimal example (plese note: below I attempt to run them concurrently using threads, but I also did the same using multiprocessing with 3 different Python scripts and got similar results):

import sys, os, threading
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

ARCHITECTURE = "gpt2"
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(ARCHITECTURE, use_fast=True)
model1 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model2 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model3 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
inputs = tokenizer(["George Washington is"], return_tensors="pt").to(DEVICE)

### RUN SEQUENTIALLY: takes 38ms (timed with %%timeit and other methods)
output1 = model1(**inputs)
output2 = model2(**inputs)
output3 = model3(**inputs)

### RUN CONCURRENTLY WITH THREADS: takes 45 ms 
thread1 = threading.Thread(target=model1, kwargs=inputs)
thread2 = threading.Thread(target=model2, kwargs=inputs)
thread3 = threading.Thread(target=model3, kwargs=inputs)
thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()

## RUN CONCURRENTLY WITH STREAMS: also takes ~45ms
p1 = torch.cuda.Stream()
p2 = torch.cuda.Stream()
p3 = torch.cuda.Stream()
with torch.cuda.stream(p1):
    output1 = model1(**inputs)
with torch.cuda.stream(p2):
    output2 = model2(**inputs)
with torch.cuda.stream(p3):
    output3 = model3(**inputs)
p1.synchronize()
p2.synchronize()
p3.synchronize()

There are many related posts, but none seemed to provide a solution. I hope that as time has progressed, new solutions might have been found?

Some pointers

Multiprocessing best practices — PyTorch 2.1 documentation → multiprocessing difficult on CUDA
Multiple replicas of the model on same GPU?

Not helpful

You are most likely running into the GIL in your approach. Have you also profiled the code to see the actual execution performed in the GPU?

Thanks a lot for the pointers @ptrblck !

The above code might run into the GIL, but I also tried the same using multiple fully independent Python scripts (where GIL shouldn’t be an issue), and still had the same slow performance.

I am now running profiling deep-dives using

from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=False) as prof:
    with record_function("model_inference"):
        model1(**inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Regarding profiling, would you suggest other tools?