As the title said, lfilter on cpu cost about 1 ms to run. But on the gpu, it is 1000x slower! I posted my code and test result below. Could anyone tell me why and how to accelerate the speed on gpu?
device="cuda"
waveform = torch.rand([16000]).to(device)
a = torch.tensor([1.0, 0.0]).to(device)
b = torch.tensor([1.0, -0.97]).to(device)
for i in range(5):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
waveform = torchaudio.functional.lfilter(waveform, a, b)
end.record()
torch.cuda.synchronize()
print("run on %s cost %.3f ms" % (device, start.elapsed_time(end)))
Test result
run on cpu cost 1.473 ms
run on cpu cost 1.531 ms
run on cpu cost 0.905 ms
run on cpu cost 0.774 ms
run on cpu cost 1.007 ms
run on cuda cost 961.567 ms
run on cuda cost 955.971 ms
run on cuda cost 962.749 ms
run on cuda cost 957.605 ms
run on cuda cost 965.437 ms
My env is v100 gpu, cuda 11, torch ā1.10.1+cu113ā and torchaudio ā0.10.1+cu113ā
If Iām not mistaken, this code will be used, which will launch a lot of GPU kernels (nsys shows 160045 cudaLaunchKernel calls) which should create a lot of CPU overhead.
To reduce the overhead caused by the kernel launches, you could try to use CUDA graphs as described here and see if it would match your use case.
Example:
import torch
import torchaudio
import time
device = 'cuda'
nb_iters = 100
# Placeholder input used for capture
waveform = torch.rand([16000]).to(device)
a = torch.tensor([1.0, 0.0]).to(device)
b = torch.tensor([1.0, -0.97]).to(device)
# warmup
for _ in range(10):
waveform = torchaudio.functional.lfilter(waveform, a, b)
# profile
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
waveform = torchaudio.functional.lfilter(waveform, a, b)
torch.cuda.synchronize()
t1 = time.perf_counter()
print('Eager, {}s/iter'.format((t1 - t0)/nb_iters))
# CUDA graphs
g = torch.cuda.CUDAGraph()
# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
for _ in range(3):
waveform = torchaudio.functional.lfilter(waveform, a, b)
torch.cuda.current_stream().wait_stream(s)
# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
waveform = torchaudio.functional.lfilter(waveform, a, b)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
g.replay()
torch.cuda.synchronize()
t1 = time.perf_counter()
print('CUDA graphs, {}s/iter'.format((t1 - t0)/nb_iters))
Output:
Eager, 0.6360926309000934s/iter
CUDA graphs, 0.07975950390042272s/iter