Use MKLDNN in pytorch

Hello,
Does pytorch support mkldnn(or dnnl) by default for inference on cpu? If not, can anyone help me on how to use it?
I came across this method to_mkldnn() in pytorch v1.2.0: https://pytorch.org/docs/master/tensors.html?highlight=mkldnn#torch.Tensor.to_mkldnn
What is the purpose of this method? Should this be used to convert tensors to ‘mkldnn’ type during forward pass?

Thanks in advance!

2 Likes
import torch
print(*torch.__config__.show().split("\n"), sep="\n")

This should tell you whether mkldnn is supported by your binaries or not. I guess the purpose is to enable native support for the mkldnn backend.

Maybe the test script could help you on what is currently supported for the moment.

3 Likes

thanks for the reply!

the following is the output of print(*torch.config.show().split("\n"), sep="\n")

PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.18.1 (Git Hash 7de7e5d02bf687f971e7668963649728356e0c20)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

when i run test_mkldnn.py, here is what is get(errors for resnet18 and resnext50_32x4d models):

======================================================================
ERROR: test_resnet18 (__main__.TestMkldnn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_mkldnn.py", line 423, in test_resnet18
    self._test_imagenet_model(model)
  File "test_mkldnn.py", line 417, in _test_imagenet_model
    mkldnn_model(x.to_mkldnn()).to_dense(),
  File "/opt/dev/envs/deep_learning/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/dev/envs/deep_learning/lib/python3.5/site-packages/torchvision/models/resnet.py", line 161, in forward
    x = x.view(x.size(0), -1)
RuntimeError: Currently Mkldnn tensor does not support view. Change to use reshape instead

======================================================================
ERROR: test_resnext50_32x4d (__main__.TestMkldnn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_mkldnn.py", line 427, in test_resnext50_32x4d
    model = torchvision.models.resnet.resnext50_32x4d(pretrained=False)
AttributeError: module 'torchvision.models.resnet' has no attribute 'resnext50_32x4d'

----------------------------------------------------------------------
Ran 29 tests in 2.538s

FAILED (errors=2)

i could see Pytorch is built with mkldnn, so does that mean inference on intel cpus by default will be optimized?

1 Like

I don’t think mkldnn is enabled by default. At least, for my build it isn’t:

  • Testing default CPU tensors:
python -m timeit --setup="import torch; net = torch.nn.Linear(1000, 2); batch = torch.rand(16, 1000)" "net(batch)"
  • Testing explicit MKLDNN backend:
python -m timeit --setup="import torch; from torch.utils import mkldnn as mkldnn_utils; net = torch.nn.Linear(1000, 2); net = mkldnn_utils.to_mkldnn(net); batch = torch.rand(16, 1000); batch = batch.to_mkldnn()" "net(batch)"

I get 1.5x speedup with mkl for these parameters.

I guess you are not passing all the tests because you didn’t build PyTorch from source.

1 Like

I have two doubts here:

  1. My build shows it is built with pytorch, so there is no need for me to build Pytorch again from source?
  2. Let’s say pytorch is built with mkldnn, we anyway have to explicitly use “to_mkldnn()” method to make use of mkldnn operations, right?

You don’t need to build PyTorch from source, but sometimes the binaries you download via conda or pypi don’t have some desired feature enabled, then the only way to get them working is building everything from source and making sure the build options match your needs.

Yes.

Thanks for the reply.
Two things here:

  1. One is I don’t get the speedup as you get with mkldnn. Infact it is taking more time with mkldnn.
    For this :
python -m timeit --setup="import torch; from torch.utils import mkldnn as mkldnn_utils; net = torch.nn.Linear(1000, 2); net = mkldnn_utils.to_mkldnn(net); batch = torch.rand(16, 1000); batch = batch.to_mkldnn()" "net(batch)"
  1. Secondly I get the following error whenever there I try to to use mkldnn.to_mkldnn(pretrained_model) such as resnet18 or resnet50
File "/opt/dev/envs/deep_learning/lib/python3.5/site-packages/torchvision/models/resnet.py", line 161, in forward
    x = x.view(x.size(0), -1)
RuntimeError: Currently Mkldnn tensor does not support view. Change to use reshape instead

I guess I should build PyTorch from source inorder to get speedup as you say.

This is weird, I would make some more tests with larger and smaller tensors. If MKLDNN doesn’t provide speedup, it means your build is probably broken.

For these architectures, I guess you could easily adapt it to support MKLDNN, just replace the view calls by reshape.

1 Like

Hi, thanks a lot for the great explanation.
I have a question regarding this.
I tried your snippets to see how much of a difference MKLDNN makes on my system. but I faced with the error :
**RuntimeError** : MKL-DNN build is disabled

My pytorch stats is as follows :

Python: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)](6, 0, 1)
Pytorch : 1.3.1+cpu
Windows: x64 ver 1803 build 17134.1184
CPU : Intel® Core™ i7-6770HQ CPU @ 2.60GHz

Any help is greatly appreciated

Hi !

How did you install pytorch ? using conda ? I have torch 1.3.0 installed and Python 3.5.4 Anaconda custom, and MKL-DNN is running. However, I’m not getting the speed-up I stated above on this setup, in fact, MKL-DNN is 10% slower than pytorch. I didn’t follow all updates on the backend improvements, but maybe the linear kernel torch is now using is better than mkl-dnn.

1 Like

I simply used the pip package and installed the cpu only version as I dont have a supported GPU on my system.
by MLDNN being 10% slower do you mean slower than the gpu implementation or the normal cpu?

I meant the cpu :slight_smile:

Why don’t you use conda ? For instance, when I run conda list | grep mkl I get:

mkl                       2019.3                      199  
mkl-include               2019.4                      243

These packages are used when you compile torch from source in order to enable mkl-dnn. I’m not sure, but maybe you should have these installed to download the binary having mkl-dnn enabled.

I’m on a windows machine (win10) and I tried installing using conda :
conda install pytorch torchvision cpuonly -c pytorch
However, its the very same as pip installation.
This is what I get from torch.__config__.show()

PyTorch built with:
  - MSVC 191125547
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - OpenMP 200203
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, 
                    CXX_FLAGS=/DWIN32 /D_WINDOWS  /GR  /w /EHa /MP /bigobj -openmp,
                    DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, USE_CUDA=False,
                    USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, 
                    USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, 
                    USE_NNPACK=OFF, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Are you on a cuda capable machine? that is, did you install the cuda version of Pytorch?
Maybe this is only enabled on the Cuda build ?
or Maybe its only available on Linux version of Pytorch and not windows?!

Nope !

Here’s the flag -> this binary didn’t compile with MKL-DNN support

Maybe ! But you can always try to build from source !

1 Like

I checked. The Linux version comes with MKLDNN enabled :

PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=0, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

but the weird thing is, its much slower than the normal scenario!

Update :
Wow! just Wow! the MKLDNN does make a HUGE difference ! its 600x faster!
I just wrote simple benchmark with normal models such as resnet18, and the difference is day and night!

MKL time: 0.61
MKLDNN time: 0.0014
: here is the snippet :

#%%
import torch
print(*torch.__config__.show().split("\n"), sep="\n")
#%%
import time
class Timer(object):
    """A simple timer."""
    def __init__(self):
        self.total_time = 0.
        self.calls = 0
        self.start_time = 0.
        self.diff = 0.
        self.average_time = 0.

    def tic(self):
        # using time.time instead of time.clock because time time.clock
        # does not normalize for multithreading
        self.start_time = time.time()

    def toc(self, average=True):
        self.diff = time.time() - self.start_time
        self.total_time += self.diff
        self.calls += 1
        self.average_time = self.total_time / self.calls
        if average:
            return self.average_time
        else:
            return self.diff

    def clear(self):
        self.total_time = 0.
        self.calls = 0
        self.start_time = 0.
        self.diff = 0.
        self.average_time = 0.

_t = {'mkl': Timer(),
      'mkldnn': Timer()}
#%%

import torch
from torchvision import models
net = models.resnet18(False)
net.eval()
batch = torch.rand(10, 3,224,224)

_t['mkl'].tic()
for i in range(1):
    net(batch)
_t['mkl'].toc()

from torch.utils import mkldnn as mkldnn_utils
net = models.resnet18(False)
net.eval()
net = mkldnn_utils.to_mkldnn(net)
batch = torch.rand(10, 3,224,224)
batch = batch.to_mkldnn()

_t['mkldnn'].tic()
for i in range(1):
    net(batch)
_t['mkldnn'].toc()

print(f"time: {_t['mkl'].average_time}s")
print(f"time: {_t['mkldnn'].average_time}s")

The catch here is, the actual net must be benchmarked (the forward pass) and also it seems to be a repitive action so the CPU actually switches to it!

Could you please tell me how you ran the code you show here? When I’m running it, I am unfortunately not being able to replicate your speeds.

My build has USE_MKLDNN=ON and I am running on Intel® Xeon® CPU E5-2640 v3 @ 2.60GHz

My PyTorch config details are:

PyTorch built with:

  • GCC 7.3
  • Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
  • Intel® MKL-DNN v0.18.1 (Git Hash 7de7e5d02bf687f971e7668963649728356e0c20)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CUDA Runtime 10.0
  • NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  • CuDNN 7.6.2
  • Magma 2.5.1
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

When I run your benchmark code, mkldnn does give me a slighter faster runtime, but nowhere near as fast as you reported.

The speeds I got are:

MKL time: 0.266s
MKLDNN time: 0.214s

Multiple runs give results in this ballpark. Am I doing something wrong? I’d really appreciate any help I can get!

Hi, we are having a discussion concerning this here .
if you wait a bit more, when we switch to DNNL, you’ll see a significant boost. however the transition is not yet complete

Thanks a lot.

In the meantime, can I expect better results using the torch.quantization module? I noticed it uses FBGEMM.

Also, is there any particular reason for supporting multiple quantization libraries (FBGEMM, MKLDNN) or is it just to provide users with their preference of library?

Yes, the quantization has a far better performance compared to simple and deprecated MKLDNN.
MKLDNN is not a quantization library. you cant use both mkldnn with quantized operators together, if I’m not mistaken.
either switch to quantization, or mkldnn. You may also have a look at TVM which supports Intel_graphics card and cpus as well. Unfortunetly, the Pytorch team, seems not to be active on that project, while others such as TF, and specially MXNet, are investing in it heavily. (Pytorch had a pytorch-tvm, branch, which is not maintained and is not up to date unfortunetly) .
Your other option is to go OpenVino. While other frameworks are supported pretty well. Pytorch is missing and is only usable through the ONNX conversion (convert you pytorch to onnx models)
and the problem with that is that, not all operators are supported on ONNX (and OpenVino doesnt support all of it either) and the conversion is not ideal and error prune.
So your best bet is to wait for the Intel-Pytorch team to transition to DNNL which has support for Inetl-gpus (meaning you’ll get gpu acceleration). Currently one PR is awaiting to be merged and this would be the foundation, after that would hopefully come the support for OpenCL and thus supporting gpus other than nvidia or AMD rocm.

Thanks a lot! This was helpful