How to construct the most efficient convolution in the case of large kernel_size

When testing the computational efficiency of convolution with different kernel sizes, it is found that when the kernel sizes are 3, 5 and 7, the computational efficiency is very high, but when the kernel size is greater than 7, the computational efficiency will be greatly reduced. In addition, when the kernel siezes are 31, 39, 45, 55, 75, 93, and 101, the efficiency will be higher than their adjacent sizes. What is the reason?

import numpy as np
import torch
from torch import nn
import time

h = 6115
w = 5490
pic = np.random.randint(0,2,(h,w),dtype=(np.int16))
Max_iters = 5
for kernel_size in range(3,110,2):    

    kernel = torch.ones((1,1,kernel_size,kernel_size))
    neighbor_func = nn.Conv2d(1, 1, kernel_size,padding=int((kernel_size-1)/2),bias = False)
    
    with torch.no_grad():
        for i,j in neighbor_func.named_parameters():
            j.copy_(kernel)
   
    
    tst = time.time()
    for i in range(Max_iters):
        N = np.where(pic == 1,1,0)
        N = np.reshape(N,(1,1,h,w))
        N = torch.from_numpy(N).type(torch.float32)
        with torch.no_grad():
            N = neighbor_func(N)
        N = N.detach().numpy()
        N = np.squeeze(N)
    print(kernel_size,time.time()-tst)  

I want to use nn.Conv2d as a tool for convolution, which only performs convolution calculation on data without involving gradient calculation and back propagation, so which function can I choose to obtain the highest performance convolution in large kernel_sizes?
20220527121918

If you are on CPU, a guess would be that these sizes fit nicely into the shapes that the kernels in the library are well-optimized for. Additionally, you might consider packing the tensor in mkldnn layout torch.Tensor.to_mkldnn — PyTorch 1.11.0 documentation and seeing if you get a different pattern for the “most efficient” kernel sizes.

Thank you for your advice! I try to use the mkldnn. But I don’t know if it’s the right way to use it.

import numpy as np
import torch
from torch import nn
import time
size_list = [3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51]
h = 6115
w = 5490
pic = np.random.randint(0,2,(h,w),dtype=(np.int16))
Max_iters = 5
for kernel_size in range(3,109,2):    

    kernel = torch.ones((1,1,kernel_size,kernel_size))
    neighbor_func = nn.Conv2d(1, 1, kernel_size,padding=int((kernel_size-1)/2),bias = False)
    
    with torch.no_grad():
        for i,j in neighbor_func.named_parameters():
            j.copy_(kernel)
    
    
    tst = time.time()
    for i in range(Max_iters):
        N = np.where(pic == 1,1,0)
        N = np.reshape(N,(1,1,h,w))
        N = torch.from_numpy(N).type(torch.float32).to_mkldnn()
        with torch.no_grad():
            N = neighbor_func(N)
        N = N.detach().to_dense().numpy()
        N = np.squeeze(N)
    print(kernel_size,time.time()-tst)

I modify the code and can not get performance improvements.

I’m a bit confused about the code is written; why does the numpy conversion need to happen in the hot loop here? That might add a lot of noise/overhead to the part of the computation that is relevant for benchmarking.

I would also check if MKLDNN is installed and being used:
(e.g., BKMs to check whether mkl or mkldnn is enabled on PyTorch · GitHub)

This is part of my code and numpy array is used in other parts. I tested the overhead of type conversion. Before using mklnn, it was 0.03s/ time, and then it was 0.09s/ time, which can be negligible compared with the computational cost of convolution. I used torch Backends Mkldnn Is_ Available(), torch.backends.mkl.is_available() and both returned true