How to reduce the overhead of conv layers?

I’m trying to optimize the inference latency, but found that there’s a strange overhead happens in one single conv layer. Even though it still fits y=kx+b relationship, but the b is unacceptable large. How should I reduce such b? Thanks in advance!

The profiling result goes as follows.

[filter, filter, input_channels, output_channels] Latency
[3, 3, 16, 8] 0.024658
[3, 3, 16, 16] 0.032011
[3, 3, 16, 32] 0.031948
[3, 3, 16, 64] 0.037025
[3, 3, 16, 128] 0.049538
[3, 3, 16, 256] 0.062251
[3, 3, 16, 512] 0.105888

The code for profiling goes as follows:

for i in range(7):
  shape = [3,3,16,2**(i+3)]
  kernel_value = np.random.rand(shape[0],shape[1],shape[2],shape[3]).astype(np.float32)
  kernel = torch.as_tensor(np.transpose(kernel_value, (3,2,0,1)))
  input_value = np.random.rand(1,16,32,32).astype(np.float32)
  x = torch.as_tensor(input_value)

  before = datetime.datetime.now()
  for j in range(100):
    if j==50:
      before = datetime.datetime.now()
    tmp = F.conv2d(x, weight=kernel,bias=None,stride=1,padding=(3-1)//2)
  after = datetime.datetime.now()
  interval = after-before
  print(str(shape)+"\t"+str(get_seconds(interval)))

I just tried it on some other machines (CPU only). The problem appears on them to.