The binary comes with a set of bundled libraries (mkl, magma, etc) that are very important for speed.
If you compile from source, you will want to make sure you have the relevant ones for your workload installed locally so that they can be used during compilation.
Also in some cases, build of openBLAS, magma, etc tailored to your machine will be slightly faster.
But if you don’t have any blas library (or use the system default one) you might see lower performance compared to the binary code that has an off the shelf optimized blas for example.
– TORCH_VERSION : 1.7.0
– CAFFE2_VERSION : 1.7.0
– BUILD_CAFFE2_MOBILE : OFF
– USE_STATIC_DISPATCH : OFF
– BUILD_BINARY : OFF
– BUILD_CUSTOM_PROTOBUF : ON
– Link local protobuf : ON
– BUILD_DOCS : OFF
– BUILD_PYTHON : True
– Python version : 3.7.4
– Python executable : /home/acer/.pyenv/versions/pytorch_build/bin/python
– Pythonlibs version : 3.7.4
– Python library : /home/acer/.pyenv/versions/3.7.4/lib/libpython3.7m.so.1.0
– Python includes : /home/acer/.pyenv/versions/3.7.4/include/python3.7m
– Python site-packages: lib/python3.7/site-packages
– BUILD_CAFFE2_OPS : ON
– BUILD_SHARED_LIBS : ON
– BUILD_TEST : True
– BUILD_JNI : OFF
– INTERN_BUILD_MOBILE :
– CLANG_CODE_COVERAGE : OFF
– USE_ASAN : OFF
– USE_CUDA : ON
– CUDA static link : OFF
– USE_CUDNN : ON
– CUDA version : 10.2
– cuDNN version : 7.6.5
– CUDA root directory : /usr/local/cuda
– CUDA library : /usr/local/cuda/lib64/stubs/libcuda.so
– cudart library : /usr/local/cuda/lib64/libcudart.so
– cublas library : /usr/lib/x86_64-linux-gnu/libcublas.so
– cufft library : /usr/local/cuda/lib64/libcufft.so
– curand library : /usr/local/cuda/lib64/libcurand.so
– cuDNN library : /usr/lib/x86_64-linux-gnu/libcudnn.so
– nvrtc : /usr/local/cuda/lib64/libnvrtc.so
– CUDA include path : /usr/local/cuda/include
– NVCC executable : /usr/local/cuda/bin/nvcc
– NVCC flags : -DONNX_NAMESPACE=onnx_torch;-gencode;arch=compute_61,code=sm_61;-Xcudafe;–diag_suppress=cc_clobber_ignored;-Xcudafe;–diag_suppress=integer_sign_change;-Xcudafe;–diag_suppress=useless_using_declaration;-Xcudafe;–diag_suppress=set_but_not_used;-Xcudafe;–diag_suppress=field_without_dll_interface;-Xcudafe;–diag_suppress=base_class_has_different_dll_interface;-Xcudafe;–diag_suppress=dll_interface_conflict_none_assumed;-Xcudafe;–diag_suppress=dll_interface_conflict_dllexport_assumed;-Xcudafe;–diag_suppress=implicit_return_from_non_void_function;-Xcudafe;–diag_suppress=unsigned_compare_with_zero;-Xcudafe;–diag_suppress=declared_but_not_referenced;-Xcudafe;–diag_suppress=bad_friend_decl;-std=c++14;-Xcompiler;-fPIC;–expt-relaxed-constexpr;–expt-extended-lambda;-Wno-deprecated-gpu-targets;–expt-extended-lambda;-gencode;arch=compute_61,code=sm_61;-Xcompiler;-fPIC;-DCUDA_HAS_FP16=1;-D__CUDA_NO_HALF_OPERATORS__;-D__CUDA_NO_HALF_CONVERSIONS__;-D__CUDA_NO_HALF2_OPERATORS__
– CUDA host compiler : /usr/bin/cc
– NVCC --device-c : OFF
– USE_TENSORRT : OFF
– USE_ROCM : OFF
– USE_EIGEN_FOR_BLAS : ON
– USE_FBGEMM : ON
– USE_FAKELOWP : OFF
– USE_FFMPEG : OFF
– USE_GFLAGS : OFF
– USE_GLOG : OFF
– USE_LEVELDB : OFF
– USE_LITE_PROTO : OFF
– USE_LMDB : OFF
– USE_METAL : OFF
– USE_MKL : OFF
– USE_MKLDNN : ON
– USE_MKLDNN_CBLAS : OFF
– USE_NCCL : ON
– USE_SYSTEM_NCCL : OFF
– USE_NNPACK : ON
– USE_NUMPY : ON
– USE_OBSERVERS : ON
– USE_OPENCL : OFF
– USE_OPENCV : OFF
– USE_OPENMP : ON
– USE_TBB : OFF
– USE_VULKAN : OFF
– USE_PROF : OFF
– USE_QNNPACK : ON
– USE_PYTORCH_QNNPACK : ON
– USE_REDIS : OFF
– USE_ROCKSDB : OFF
– USE_ZMQ : OFF
– USE_DISTRIBUTED : ON
– USE_MPI : ON
– USE_GLOO : ON
– USE_TENSORPIPE : ON
In that case you want to make sure you have the latest cuda/cudnn installed.
For distributed, the gloo backend should be already properly provided.
For mixed precision I think basic cuda libs handle that well.
For linear algebra. You want a good blas/lapack lib on cpu like mkl or openblas. And for gpu you will need magma.