Performance on small conv

knitvoger · May 12, 2020, 4:23am

I called torch.set_num_threads(8) to use all my cpu cores.
The result shows that running big conv, the performance is obvious faster on 8 threads compared with single thread. But runs small conv, the speed has no big differences with 8 threads and single thread.

Please check this table conv_perf

Mkl-dnn is enabled and below is the verbose info when running

mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw8c,num:1,1x32x1x107718,2.18091
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw8i8o,num:1,32x32x1x1,0.00195312
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct,mb1_ic32oc32_ih1oh1kh1sh1dh0ph0_iw107718ow107718kw1sw1dw0pw0,3.87793
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw8c out:f32_nchw,num:1,1x32x1x107718,4.69116

So why 8 threads is not obviously faster than 1 thread with small conv? And how can i improve the speed on small conv?

Thanks a lot!

My code

import time
import torch
import torch.nn as nn
from torch.nn.utils import weight_norm

class MyConv(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.cell = nn.Conv1d(*args, **kwargs)
        self.cell.weight.data.normal_(0.0, 0.02)

    def forward(self, x):
        return self.cell(x)


def main():
    #print(*torch.__config__.show().split("\n"), sep="\n")
    torch.set_num_threads(1)
    dim = 32
    kernels = 3
    seq = 100000
    MyCell = MyConv(dim, dim, kernel_size=kernels, stride=1)
    MyCell.eval()
    inputs = []
    iter = 1000
    for i in range(iter):
        inputs.append(torch.rand(1, dim, seq))

    start = time.time() * 1000
    for i in range(iter):
        print(i)
        y = MyCell(inputs[i])
        #print(y)
    end = time.time() * 1000
    print('cost %d ms per iter\n' % ((end - start) / iter))


if __name__ == "__main__":
    main()

Nikronic · May 12, 2020, 8:27am

Hi,
I am not experienced in this area and just wanted to discuss about it.

Intuitively, I can say that when you enable multi-threading, you are adding some additional costs such as context switching between threads. So, for any operation on a small tensor, the additional costs of multi-threading compensate the speed up. Same thing applies in using GPU vs. CPU.
For instance, using GPU for operations on small tensors has almost same performance on CPU too but when tensor size grows exponentially, the distance between the performances increases enormously.

knitvoger · May 12, 2020, 4:24pm

Thanks for discussion @Nikronic. Beside thread swith, the time to memory io may also affect the performance for small tensors because the io time is relatively took more cost in total processing time.

And I did a test to divide the inputs [1, 32, 100000] to 8 part, each part is [1, 32, 12500]. And runs conv for each part in 8 thread. Compare to running in single thread, this can speed up about 4x. The only additional thing to do is concat each threads’ result and deal with the boundaries. Seems pytorch (and mkl-dnn backend) is not doing small conv with big inputs in this way.