Are there any reasons why running GPU-inference in a thread would be slower?

Consider this code where I compare running a simple neural network in a thread vs directly (and I’ll highlight now: the thread is started then immediately joined):

import threading
from time import perf_counter

import torch

DEVICE = torch.device("cuda")

class Foo:
    def __init__(self, model):
        self.model = model

    def foo(self, tensor):
        with torch.inference_mode():
            self.model(tensor)

    def run(self):
        # Warmup
        for _ in range(1000):
            tensor = torch.randn(size=(1000, 1000), device=DEVICE)
            self.foo(tensor)

        start = perf_counter()
        for _ in range(1000):
            tensor = torch.randn(size=(1000, 1000), device=DEVICE)
            self.thread = threading.Thread(target=self.foo, args=(tensor,))
            self.thread.start()
            self.thread.join()
        print("Loop with threads:", perf_counter() - start)

        start = perf_counter()
        for _ in range(1000):
            tensor = torch.randn(size=(1000, 1000), device=DEVICE)
            self.foo(tensor)
        print("Simple loop:", perf_counter() - start)

model = torch.nn.Sequential(*[torch.nn.Linear(1000, 1000) for _ in range(10)]).to(DEVICE)
foo = Foo(model)
foo.run()

I actually find that both loops run for the same amount of time. Great!

But, for my actual model, I get that the threaded version is 10x slower. One extra clue, is that if I run the inference on CPU, there is no relative slow down for the threaded version.

I’d like to get some clues as to why (and if not, I suppose I’ll have to give an example that’s truer to my actual model, which is not easy to do).

Btw my ultimate goal is to be running GPU-inference in order to populate a cache then be able to consume from that cache with CPU-bound work. In the meantime, GPU-inference runs again so that I can keep my cache from getting depleted, and never have to wait for inference.

Yes, this might be needed as your current code snippet isn’t able to reproduce the issue.
In case it’s hard to isolate it down to a minimal code, you could try to profile your code to see which operations might cause the slowdown in both approaches.

@ptrblck thanks for picking this up! I dug deep into my model and found one of my submodules needed to be “warmed up”. I know of this phenomenon but have never been quite clear on it. It’s the idea that when you time inference you should do some warmup runs first before starting your timing loop because the first few runs take extra time.

So I discovered that these warmup effects are present within a thread as well. So even if you do

# Attempt to warm up.
for _ in range(10):
    run_inference()

# Warmup doesn't help here!
start = time.time()
for _ in range(10):
    thread = threading.Thread(target=run_inference)
    thread.start()
    thread.join()
print(time.time() - start)

# Warmup helps here!
start = time.time()
for _ in range(10):
    run_inference()
print(time.time() - start)

You still suffer warmup effects with each loop iteration for the threaded version. The solution is to have 1 thread for everything (which you can manage with queues). I was able to verify this works.

I’d love to get your take on this. Btw here is the block of code that I know goes slower the first time around: here where the convs are defined here and the downsample is defined here.

Warmups are needed for proper profiling as you could otherwise profile the startup time after the device was set to IDLE and reduced its power usage.
With that being said, the actual issue is also the general usage of host timers without proper synchronizations. CUDA operations are executed asynchronously. If you use host timers without synchronizing the GPU you would measure the dispatching and kernel launches in the best case and noise in the worst case.
Since you are explicitly using multiple threads, you would need to define what exactly you are interested in profiling (the launches while the GPU is still busy or the GPU execution time).
A visual profiler might help here as it would show all threads and the execution time.

1 Like