Running inference for 3 GPT2 models concurrently is slower than sequentially. How to improve?

trianxy · October 10, 2022, 5:44pm

Hey team,

I have an NVidia GPU with 24 GB RAM (g5.xlarge on AWS) and 3 differently fine-tuned GPT2 models. Unfortunately, running them concurrently is slower (45ms for 1 token each) than running them sequentially (38ms for 1 token each).

I expected running them concurrently to be close to 3x faster because size of GPU memory is not a problem and GPU utilisiation also seems to be always below 33%.

Any ideas or packages I should try out?

Here is a minimal example (plese note: below I attempt to run them concurrently using threads, but I also did the same using multiprocessing with 3 different Python scripts and got similar results):

import sys, os, threading
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

ARCHITECTURE = "gpt2"
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(ARCHITECTURE, use_fast=True)
model1 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model2 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model3 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
inputs = tokenizer(["George Washington is"], return_tensors="pt").to(DEVICE)

### RUN SEQUENTIALLY: takes 38ms (timed with %%timeit and other methods)
output1 = model1(**inputs)
output2 = model2(**inputs)
output3 = model3(**inputs)

### RUN CONCURRENTLY WITH THREADS: takes 45 ms 
thread1 = threading.Thread(target=model1, kwargs=inputs)
thread2 = threading.Thread(target=model2, kwargs=inputs)
thread3 = threading.Thread(target=model3, kwargs=inputs)
thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()

## RUN CONCURRENTLY WITH STREAMS: also takes ~45ms
p1 = torch.cuda.Stream()
p2 = torch.cuda.Stream()
p3 = torch.cuda.Stream()
with torch.cuda.stream(p1):
    output1 = model1(**inputs)
with torch.cuda.stream(p2):
    output2 = model2(**inputs)
with torch.cuda.stream(p3):
    output3 = model3(**inputs)
p1.synchronize()
p2.synchronize()
p3.synchronize()

There are many related posts, but none seemed to provide a solution. I hope that as time has progressed, new solutions might have been found?

Some pointers

Multiprocessing best practices — PyTorch 2.1 documentation → multiprocessing difficult on CUDA
Multiple replicas of the model on same GPU?

Not helpful

ptrblck · October 11, 2022, 5:59am

You are most likely running into the GIL in your approach. Have you also profiled the code to see the actual execution performed in the GPU?

trianxy · October 11, 2022, 11:58am

Thanks a lot for the pointers @ptrblck !

The above code might run into the GIL, but I also tried the same using multiple fully independent Python scripts (where GIL shouldn’t be an issue), and still had the same slow performance.

I am now running profiling deep-dives using

from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=False) as prof:
    with record_function("model_inference"):
        model1(**inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Regarding profiling, would you suggest other tools?