Torchaudio.functional.lfilter runs very slow on gpu, but fast on cpu

As the title said, lfilter on cpu cost about 1 ms to run. But on the gpu, it is 1000x slower! I posted my code and test result below. Could anyone tell me why and how to accelerate the speed on gpu?

device="cuda"
waveform = torch.rand([16000]).to(device)
a = torch.tensor([1.0, 0.0]).to(device)
b = torch.tensor([1.0, -0.97]).to(device)
for i in range(5):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    waveform = torchaudio.functional.lfilter(waveform, a, b)
    end.record() 
    torch.cuda.synchronize()
    print("run on %s cost %.3f ms" % (device, start.elapsed_time(end)))

Test result

run on cpu cost 1.473 ms
run on cpu cost 1.531 ms
run on cpu cost 0.905 ms
run on cpu cost 0.774 ms
run on cpu cost 1.007 ms
run on cuda cost 961.567 ms
run on cuda cost 955.971 ms
run on cuda cost 962.749 ms
run on cuda cost 957.605 ms
run on cuda cost 965.437 ms

My env is v100 gpu, cuda 11, torch ā€˜1.10.1+cu113ā€™ and torchaudio ā€˜0.10.1+cu113ā€™

Thank you very much!

If Iā€™m not mistaken, this code will be used, which will launch a lot of GPU kernels (nsys shows 160045 cudaLaunchKernel calls) which should create a lot of CPU overhead.
To reduce the overhead caused by the kernel launches, you could try to use CUDA graphs as described here and see if it would match your use case.

Example:

import torch
import torchaudio
import time

device = 'cuda'
nb_iters = 100

# Placeholder input used for capture
waveform = torch.rand([16000]).to(device)
a = torch.tensor([1.0, 0.0]).to(device)
b = torch.tensor([1.0, -0.97]).to(device)

# warmup
for _ in range(10):
    waveform = torchaudio.functional.lfilter(waveform, a, b)

# profile
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    waveform = torchaudio.functional.lfilter(waveform, a, b)
torch.cuda.synchronize()
t1 = time.perf_counter()

print('Eager, {}s/iter'.format((t1 - t0)/nb_iters))

# CUDA graphs

g = torch.cuda.CUDAGraph()

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        waveform = torchaudio.functional.lfilter(waveform, a, b)
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    waveform = torchaudio.functional.lfilter(waveform, a, b)

torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    g.replay()
torch.cuda.synchronize()
t1 = time.perf_counter()

print('CUDA graphs, {}s/iter'.format((t1 - t0)/nb_iters))

Output:

Eager, 0.6360926309000934s/iter
CUDA graphs, 0.07975950390042272s/iter
1 Like