I have an NVidia GPU with 24 GB RAM (g5.xlarge on AWS) and 3 differently fine-tuned GPT2 models. Unfortunately, running them concurrently is slower (45ms for 1 token each) than running them sequentially (38ms for 1 token each).
I expected running them concurrently to be close to 3x faster because size of GPU memory is not a problem and GPU utilisiation also seems to be always below 33%.
Any ideas or packages I should try out?
Here is a minimal example (plese note: below I attempt to run them concurrently using threads, but I also did the same using multiprocessing with 3 different Python scripts and got similar results):
import sys, os, threading
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ARCHITECTURE = "gpt2"
DEVICE = "cuda"
tokenizer = AutoTokenizer.from_pretrained(ARCHITECTURE, use_fast=True)
model1 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model2 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
model3 = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to(DEVICE)
inputs = tokenizer(["George Washington is"], return_tensors="pt").to(DEVICE)
### RUN SEQUENTIALLY: takes 38ms (timed with %%timeit and other methods)
output1 = model1(**inputs)
output2 = model2(**inputs)
output3 = model3(**inputs)
### RUN CONCURRENTLY WITH THREADS: takes 45 ms
thread1 = threading.Thread(target=model1, kwargs=inputs)
thread2 = threading.Thread(target=model2, kwargs=inputs)
thread3 = threading.Thread(target=model3, kwargs=inputs)
thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()
## RUN CONCURRENTLY WITH STREAMS: also takes ~45ms
p1 = torch.cuda.Stream()
p2 = torch.cuda.Stream()
p3 = torch.cuda.Stream()
with torch.cuda.stream(p1):
output1 = model1(**inputs)
with torch.cuda.stream(p2):
output2 = model2(**inputs)
with torch.cuda.stream(p3):
output3 = model3(**inputs)
p1.synchronize()
p2.synchronize()
p3.synchronize()
There are many related posts, but none seemed to provide a solution. I hope that as time has progressed, new solutions might have been found?
The above code might run into the GIL, but I also tried the same using multiple fully independent Python scripts (where GIL shouldn’t be an issue), and still had the same slow performance.
I am now running profiling deep-dives using
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=False) as prof:
with record_function("model_inference"):
model1(**inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
Regarding profiling, would you suggest other tools?