# Inference speed problem

Hello.

I’m developing a simple Python application for personal use that detects the fundamental frequency of two harmonic sounds simultaneously. To achieve this, I’ve created and trained a small PyTorch neural network. I’m going to use it on a machine without GPU, so the code blow is executed on CPU.

``````Network(
(conv1): DepthwiseConvBlock(
(depth_conv): Conv1d(1, 16, kernel_size=(11,), stride=(1,), padding=same)
(point_conv): Conv1d(16, 16, kernel_size=(1,), stride=(1,))
(nonlin): GELU(approximate='none')
(norm): LayerNorm((48,), eps=1e-05, elementwise_affine=True)
(pool): MaxPool2d(kernel_size=[1, 2], stride=[1, 2], padding=0, dilation=1, ceil_mode=False)
)
(conv2): DepthwiseConvBlock(
(depth_conv): Conv1d(16, 16, kernel_size=(7,), stride=(1,), padding=same, groups=16)
(point_conv): Conv1d(16, 16, kernel_size=(1,), stride=(1,))
(nonlin): GELU(approximate='none')
(norm): LayerNorm((24,), eps=1e-05, elementwise_affine=True)
(pool): MaxPool2d(kernel_size=[1, 2], stride=[1, 2], padding=0, dilation=1, ceil_mode=False)
)
(conv3): DepthwiseConvBlock(
(depth_conv): Conv1d(16, 8, kernel_size=(5,), stride=(1,), padding=same, groups=8)
(point_conv): Conv1d(8, 8, kernel_size=(1,), stride=(1,))
(nonlin): GELU(approximate='none')
(norm): LayerNorm((12,), eps=1e-05, elementwise_affine=True)
(pool): MaxPool2d(kernel_size=[1, 2], stride=[1, 2], padding=0, dilation=1, ceil_mode=False)
)
(flat): Flatten(start_dim=1, end_dim=-1)
(linear): Linear(in_features=96, out_features=8, bias=True)
)
``````

The input to this network is a spectrum calculated using the CQT transform function operating on NumPy arrays audio. The network, in turn, produces probabilities for the detected notes. However, there’s an issue: when I run the network in a loop alongside the CQT transform function, the network’s inference speed experiences a significant drop.

Just CQT transform function:

``````def profile_cqt_to_torch():
cqts = []
for a in audio:
cqt = cqt_rough(a)
cqts.append(torch.from_numpy(cqt).unsqueeze(0).to(torch.float))
return cqts

t = time()
cqts = profile_cqt_to_torch()
print(time() - t)

>>> 0.35244059562683105
``````

Just the network with precalculated CQTs:

``````network.eval()

def profile_net():
preds = []
for cqt in cqts:
pred = network(cqt)
preds.append(pred)
return pred

t = time()
preds = profile_net()
print(time() - t)

>>> [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.
>>> 0.5710155963897705
``````

Combined:

``````network.eval()

def profile_cqt_net():
preds = []
for a in audio:
cqt = cqt_rough(a)
pred = network(torch.from_numpy(cqt).unsqueeze(0).to(torch.float))
preds.append(pred)
return preds

t = time()
preds = profile_cqt_net()
print(time() - t)

>>> 15.845890522003174
``````

By utilizing the new `torch.compile` method with `mode = reduce-overhead`, I’ve managed to decrease the execution time to approximately 10 seconds. Nevertheless, this duration still seems excessively long.

`cProfile` and `torch.profiler` gives a similar results. I check `dtype` and `size` of input tensor every time.

Profilng with precalculated CQTs:
Code:

``````def profile_net():
preds = []
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
for cqt in cqts:

assert cqt.shape == torch.Size([1, 96])
assert cqt.dtype == torch.float32

with record_function("model_inference"):
pred = network(cqt)
preds.append(pred)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
return pred
``````

cProfile results:

``````92544 function calls (72544 primitive calls) in 0.850 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2    0.000    0.000    0.850    0.425 /home/user/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3406(run_code)
2    0.000    0.000    0.850    0.425 {built-in method builtins.exec}
1    0.000    0.000    0.849    0.849 <ipython-input-24-33789e8bbc84>:1(<module>)
1    0.007    0.007    0.849    0.849 <ipython-input-23-24b737ab1132>:6(profile_net)
10500/500    0.023    0.000    0.828    0.002 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1514(_wrapped_call_impl)
10500/500    0.044    0.000    0.827    0.002 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1520(_call_impl)
500    0.020    0.000    0.823    0.002 /mnt/3660917E7771EA5C/Programming/PitchDetect/TheModel.py:71(forward)
1500    0.049    0.000    0.720    0.000 /mnt/3660917E7771EA5C/Programming/PitchDetect/TheModel.py:49(forward)
3000    0.016    0.000    0.378    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py:309(forward)
3000    0.008    0.000    0.356    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py:301(_conv_forward)
3000    0.348    0.000    0.348    0.000 {built-in method torch.conv1d}
1500    0.003    0.000    0.087    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/activation.py:681(forward)
1500    0.084    0.000    0.084    0.000 {built-in method torch._C._nn.gelu}
1500    0.010    0.000    0.068    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/normalization.py:195(forward)
1500    0.006    0.000    0.060    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/pooling.py:165(forward)
1500    0.008    0.000    0.055    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/functional.py:2528(layer_norm)
1500    0.005    0.000    0.055    0.000 /home/user/.local/lib/python3.10/site-packages/torch/_jit_internal.py:478(fn)
1500    0.004    0.000    0.050    0.000 /home/user/.local/lib/python3.10/site-packages/torch/nn/functional.py:769(_max_pool2d)
1500    0.045    0.000    0.045    0.000 {built-in method torch.max_pool2d}
1500    0.044    0.000    0.044    0.000 {built-in method torch.layer_norm}
``````

pytorch.profiler results:

``````---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
model_inference        33.48%     264.789ms        99.68%     788.410ms       1.577ms           500
aten::conv1d         2.02%      16.011ms        43.88%     347.084ms     115.695us          3000
aten::convolution         1.94%      15.362ms        41.53%     328.489ms     109.496us          3000
aten::_convolution         5.33%      42.175ms        39.61%     313.289ms     104.430us          3000
aten::_convolution_mode         1.38%      10.923ms        29.09%     230.100ms     153.400us          1500
aten::mkldnn_convolution        18.30%     144.708ms        19.21%     151.955ms     151.955us          1000
aten::thnn_conv2d         0.71%       5.634ms        10.87%      85.951ms      42.975us          2000
aten::_slow_conv2d_forward         6.81%      53.884ms        10.21%      80.788ms      40.394us          2000
aten::gelu         6.63%      52.461ms         6.63%      52.461ms      34.974us          1500
aten::layer_norm         0.64%       5.027ms         5.72%      45.237ms      30.158us          1500
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 790.968ms
``````

With both functions combined:
Code:

``````def profile_cqt_net():
preds = []
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
for a in audio:
cqt = cqt_rough(a)
cqt = torch.from_numpy(cqt).unsqueeze(0).to(torch.float)

assert cqt.shape == torch.Size([1, 96])
assert cqt.dtype == torch.float32

with record_function("model_inference"):
pred = network(cqt)
preds.append(pred)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
return preds
``````
``````         109039 function calls (107539 primitive calls) in 11.783 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2    0.000    0.000   11.783    5.891 /home/user/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3406(run_code)
2    0.000    0.000   11.783    5.891 {built-in method builtins.exec}
1    0.001    0.001   11.782   11.782 <ipython-input-16-b93e40bc29de>:1(<module>)
1    0.027    0.027   11.782   11.782 <ipython-input-13-9f753a9ee13c>:7(profile_cqt_net)
1000/500    0.005    0.000   10.409    0.021 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1514(_wrapped_call_impl)
1000/500    0.015    0.000   10.407    0.021 /home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1520(_call_impl)
1000/500    0.020    0.000   10.397    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:307(_fn)
500    0.003    0.000   10.333    0.021 /mnt/3660917E7771EA5C/Programming/PitchDetect/TheModel.py:71(forward)
500    0.002    0.000   10.311    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_dynamo/external_utils.py:15(inner)
500    0.003    0.000   10.308    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:3901(forward)
500    0.002    0.000   10.305    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1481(g)
500    0.010    0.000   10.303    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2519(runtime_wrapper)
500    0.007    0.000   10.292    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1498(call_func_with_args)
500    0.002    0.000   10.282    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1583(rng_functionalization_wrapper)
500    0.002    0.000   10.280    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:373(__call__)
500    0.002    0.000   10.277    0.021 /home/user/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:385(_run_from_cache)
500    0.169    0.000   10.275    0.021 /tmp/torchinductor_user/fy/cfyr3q7akk2fssiuzxfhjdricxsnhqemn66ccqvr6xnodgbjmeai.py:297(call)
3000    9.927    0.003    9.927    0.003 {built-in method torch.convolution}
500    0.010    0.000    1.275    0.003 /mnt/3660917E7771EA5C/Programming/PitchDetect/cqt_rough.py:77(__call__)
1500    1.154    0.001    1.242    0.001 /mnt/3660917E7771EA5C/Programming/PitchDetect/cqt_rough.py:74(apply_filters)
1500    0.006    0.000    0.088    0.000 {method 'mean' of 'numpy.ndarray' objects}
1500    0.033    0.000    0.082    0.000 /home/user/.local/lib/python3.10/site-packages/numpy/core/_methods.py:163(_mean)
2500    0.010    0.000    0.069    0.000 /home/user/.local/lib/python3.10/site-packages/torch/_ops.py:687(__call__)
2500    0.059    0.000    0.059    0.000 {built-in method torch._ops.inductor._reinterpret_tensor}
3500    0.054    0.000    0.054    0.000 {built-in method torch.empty_strided}
500    0.036    0.000    0.036    0.000 {built-in method torch.addmm}
1500    0.024    0.000    0.024    0.000 {method 'reduce' of 'numpy.ufunc' objects}
``````
``````---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
model_inference         2.63%     643.239ms        99.60%       24.327s      48.653ms           500
aten::conv1d         0.38%      92.808ms        40.14%        9.805s       3.268ms          3000
aten::convolution         0.18%      42.919ms        39.91%        9.748s       3.249ms          3000
aten::_convolution         0.41%     100.469ms        39.73%        9.705s       3.235ms          3000
aten::_convolution_mode         0.11%      27.077ms        38.94%        9.512s       6.341ms          1500
aten::mkldnn_convolution        29.10%        7.108s        29.22%        7.137s       7.137ms          1000
aten::gelu        28.33%        6.920s        28.33%        6.920s       4.614ms          1500
aten::layer_norm         0.07%      17.180ms        27.44%        6.701s       4.467ms          1500
aten::native_layer_norm        27.26%        6.658s        27.36%        6.684s       4.456ms          1500
aten::thnn_conv2d         0.08%      19.653ms         9.72%        2.375s       1.187ms          2000
aten::_slow_conv2d_forward         9.33%        2.280s         9.67%        2.363s       1.181ms          2000
aten::unsqueeze         0.33%      79.823ms         0.41%     100.883ms      13.451us          7500
aten::linear         0.08%      20.467ms         0.34%      82.141ms     164.282us           500
aten::max_pool2d         0.08%      18.410ms         0.29%      71.995ms      47.997us          1500
aten::to         0.04%       8.668ms         0.22%      54.208ms     108.416us           500
aten::copy_         0.22%      54.167ms         0.22%      54.167ms      18.056us          3000
aten::max_pool2d_with_indices         0.22%      53.585ms         0.22%      53.585ms      35.723us          1500
aten::_to_copy         0.11%      25.727ms         0.19%      46.555ms      93.110us           500
aten::view         0.17%      42.428ms         0.17%      42.428ms       4.243us         10000
aten::addmm         0.11%      27.823ms         0.17%      41.023ms      82.046us           500
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 24.425s
``````

OS: Ubuntu 22.04
PyTorch: 2.1.0
I have no clue how to fix this problem and hoping for help.