Is building from source really speed up inference speed?

Hi,

I tried to build from source of pytorch on gtx-1070ti.

python setup.py build develop

but found no inference speed up, the results seem to be opposite to this post (https://medium.com/repro-repo/build-pytorch-from-source-on-ubuntu-18-04-1c5556ca8fbf)

any idea?

Hi,

The binary comes with a set of bundled libraries (mkl, magma, etc) that are very important for speed.
If you compile from source, you will want to make sure you have the relevant ones for your workload installed locally so that they can be used during compilation.
Also in some cases, build of openBLAS, magma, etc tailored to your machine will be slightly faster.
But if you don’t have any blas library (or use the system default one) you might see lower performance compared to the binary code that has an off the shelf optimized blas for example.

pytorch has many third party libraries, but i am not sure which one is the key library to speed up the performance.

Do you have any suggestion that how to enable these libraries when compiling?

Here are my cmake output, which library I have to enable in order to speed up the inference?

– ******** Summary ********
– General:
– CMake version : 3.15.4
– CMake command : /home/acer/cmake-3.15.4-Linux-x86_64/bin/cmake
– System : Linux
– C++ compiler : /usr/bin/c++
– C++ compiler id : GNU
– C++ compiler version : 7.5.0
– BLAS : MKL
– CXX flags : -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow
– Build type : Release
– Compile definitions : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;ONNX_NAMESPACE=onnx_torch;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;USE_EXTERNAL_MZCRC;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
– CMAKE_PREFIX_PATH : /home/acer/.pyenv/versions/pytorch_build/lib/python3.7/site-packages;/usr/local/cuda
– CMAKE_INSTALL_PREFIX : /home/acer/nfs-share/pytorch/torch

– TORCH_VERSION : 1.7.0
– CAFFE2_VERSION : 1.7.0
– BUILD_CAFFE2_MOBILE : OFF
– USE_STATIC_DISPATCH : OFF
– BUILD_BINARY : OFF
– BUILD_CUSTOM_PROTOBUF : ON
– Link local protobuf : ON
– BUILD_DOCS : OFF
– BUILD_PYTHON : True
– Python version : 3.7.4
– Python executable : /home/acer/.pyenv/versions/pytorch_build/bin/python
– Pythonlibs version : 3.7.4
– Python library : /home/acer/.pyenv/versions/3.7.4/lib/libpython3.7m.so.1.0
– Python includes : /home/acer/.pyenv/versions/3.7.4/include/python3.7m
– Python site-packages: lib/python3.7/site-packages
– BUILD_CAFFE2_OPS : ON
– BUILD_SHARED_LIBS : ON
– BUILD_TEST : True
– BUILD_JNI : OFF
– INTERN_BUILD_MOBILE :
– CLANG_CODE_COVERAGE : OFF
– USE_ASAN : OFF
– USE_CUDA : ON
– CUDA static link : OFF
– USE_CUDNN : ON
– CUDA version : 10.2
– cuDNN version : 7.6.5
– CUDA root directory : /usr/local/cuda
– CUDA library : /usr/local/cuda/lib64/stubs/libcuda.so
– cudart library : /usr/local/cuda/lib64/libcudart.so
– cublas library : /usr/lib/x86_64-linux-gnu/libcublas.so
– cufft library : /usr/local/cuda/lib64/libcufft.so
– curand library : /usr/local/cuda/lib64/libcurand.so
– cuDNN library : /usr/lib/x86_64-linux-gnu/libcudnn.so
– nvrtc : /usr/local/cuda/lib64/libnvrtc.so
– CUDA include path : /usr/local/cuda/include
– NVCC executable : /usr/local/cuda/bin/nvcc
– NVCC flags : -DONNX_NAMESPACE=onnx_torch;-gencode;arch=compute_61,code=sm_61;-Xcudafe;–diag_suppress=cc_clobber_ignored;-Xcudafe;–diag_suppress=integer_sign_change;-Xcudafe;–diag_suppress=useless_using_declaration;-Xcudafe;–diag_suppress=set_but_not_used;-Xcudafe;–diag_suppress=field_without_dll_interface;-Xcudafe;–diag_suppress=base_class_has_different_dll_interface;-Xcudafe;–diag_suppress=dll_interface_conflict_none_assumed;-Xcudafe;–diag_suppress=dll_interface_conflict_dllexport_assumed;-Xcudafe;–diag_suppress=implicit_return_from_non_void_function;-Xcudafe;–diag_suppress=unsigned_compare_with_zero;-Xcudafe;–diag_suppress=declared_but_not_referenced;-Xcudafe;–diag_suppress=bad_friend_decl;-std=c++14;-Xcompiler;-fPIC;–expt-relaxed-constexpr;–expt-extended-lambda;-Wno-deprecated-gpu-targets;–expt-extended-lambda;-gencode;arch=compute_61,code=sm_61;-Xcompiler;-fPIC;-DCUDA_HAS_FP16=1;-D__CUDA_NO_HALF_OPERATORS__;-D__CUDA_NO_HALF_CONVERSIONS__;-D__CUDA_NO_HALF2_OPERATORS__
– CUDA host compiler : /usr/bin/cc
– NVCC --device-c : OFF
– USE_TENSORRT : OFF
– USE_ROCM : OFF
– USE_EIGEN_FOR_BLAS : ON
– USE_FBGEMM : ON
– USE_FAKELOWP : OFF
– USE_FFMPEG : OFF
– USE_GFLAGS : OFF
– USE_GLOG : OFF
– USE_LEVELDB : OFF
– USE_LITE_PROTO : OFF
– USE_LMDB : OFF
– USE_METAL : OFF
– USE_MKL : OFF
– USE_MKLDNN : ON
– USE_MKLDNN_CBLAS : OFF
– USE_NCCL : ON
– USE_SYSTEM_NCCL : OFF
– USE_NNPACK : ON
– USE_NUMPY : ON
– USE_OBSERVERS : ON
– USE_OPENCL : OFF
– USE_OPENCV : OFF
– USE_OPENMP : ON
– USE_TBB : OFF
– USE_VULKAN : OFF
– USE_PROF : OFF
– USE_QNNPACK : ON
– USE_PYTORCH_QNNPACK : ON
– USE_REDIS : OFF
– USE_ROCKSDB : OFF
– USE_ZMQ : OFF
– USE_DISTRIBUTED : ON
– USE_MPI : ON
– USE_GLOO : ON
– USE_TENSORPIPE : ON

It all depend what is your workload :slight_smile:
Do you use cpu or gpu? Do you use distributed? Do you use mixed precision? Do you use linear algebra functions?

I use nvidia gpu and distributed, also mixed precision. and of course linear algebra functions.

In that case you want to make sure you have the latest cuda/cudnn installed.
For distributed, the gloo backend should be already properly provided.
For mixed precision I think basic cuda libs handle that well.
For linear algebra. You want a good blas/lapack lib on cpu like mkl or openblas. And for gpu you will need magma.

You can find some instructions here: https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md