I have built pytorch from source in linux 22.04 with cuda 12.8. My system has RTX A4500 GPU. I have observed several differences between pre-build pytorch and the source build one. The inference time of a vision transformer model is much more higher in source build pytorch than the pre-build one. Also if I try to profile a particular kernel using nsight compute, in almost all metrics the source build pytorch generates noisy measurement and it is prominent if I plot those metric data and compare with pre-build pytorch. Any Idea why this kind of situation occurs?
This is my build command for pytorch:
“USE_CUDA=1 USE_MKLDNN=0 USE_QNNPACK=0 USE_XNNPACK=0 BUILD_TEST=0 CMAKE_BUILD_TYPE=Release MAX_JOBS=6 python3 setup.py develop”
Also cuDNN is not used as I have seen the build configuration log. I have tried in both conda and python environment.
Make sure all math libraries (e.g. cuBLAS etc) are matching and enabled. Also what’s your use case for your source build?
Thank you for your response. I’ll check whether those math libraries are enabled. Are these libraries also installed automatically during a standard installation of the pre-built PyTorch package?
I’m currently conducting a root cause analysis of the input-dependent behavior of a specific kernel used in ViT models on NVIDIA GPUs.
Yes, you can simply install any of our PyTorch binaries and will see which libs are installed in addition in the install logs printed to the terminal.
-- ******** Summary ********
-- General:
-- CMake version : 4.0.3
-- CMake command : /home/arunava/miniconda3/envs/ac/lib/python3.10/site-packages/cmake/data/bin/cmake
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- C++ compiler id : GNU
-- C++ compiler version : 11.4.0
-- Using ccache if found : ON
-- Found ccache : CCACHE_PROGRAM-NOTFOUND
-- CXX flags : -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow
-- Shared LD flags : -Wl,--no-as-needed -rdynamic
-- Static LD flags :
-- Module LD flags :
-- Build type : Release
-- Compile definitions : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;ONNX_NAMESPACE=onnx_torch;IDEEP_USE_MKL;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;USE_EXTERNAL_MZCRC;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
-- CMAKE_PREFIX_PATH : /home/arunava/miniconda3/envs/ac/lib/python3.10/site-packages;/home/arunava/miniconda3/envs/ac:/home/arunava/miniconda3/envs/ac:;/usr/local/cuda-12.8;/usr/local/cuda-12.8;/usr/local/cuda-12.8
-- CMAKE_INSTALL_PREFIX : /home/arunava/gpu_experiment/RCA/pytorch/torch
-- USE_GOLD_LINKER : OFF
--
-- TORCH_VERSION : 2.7.1
-- BUILD_STATIC_RUNTIME_BENCHMARK: OFF
-- BUILD_BINARY : OFF
-- BUILD_CUSTOM_PROTOBUF : ON
-- Link local protobuf : ON
-- BUILD_PYTHON : True
-- Python version : 3.10.18
-- Python executable : /home/arunava/miniconda3/envs/ac/bin/python3
-- Python library :
-- Python includes : /home/arunava/miniconda3/envs/ac/include/python3.10
-- Python site-package : /home/arunava/miniconda3/envs/ac/lib/python3.10/site-packages
-- BUILD_SHARED_LIBS : ON
-- CAFFE2_USE_MSVC_STATIC_RUNTIME : OFF
-- BUILD_TEST : False
-- BUILD_JNI : OFF
-- BUILD_MOBILE_AUTOGRAD : OFF
-- BUILD_LITE_INTERPRETER: OFF
-- INTERN_BUILD_MOBILE :
-- TRACING_BASED : OFF
-- USE_BLAS : 1
-- BLAS : mkl
-- BLAS_HAS_SBGEMM :
-- USE_LAPACK : 1
-- LAPACK : mkl
-- USE_ASAN : OFF
-- USE_TSAN : OFF
-- USE_CPP_CODE_COVERAGE : OFF
-- USE_CUDA : 1
-- Split CUDA :
-- CUDA static link : OFF
-- USE_CUDNN : 1
-- USE_CUSPARSELT : ON
-- USE_CUDSS : ON
-- USE_CUFILE : ON
-- CUDA version : 12.8
-- USE_FLASH_ATTENTION : ON
-- USE_MEM_EFF_ATTENTION : ON
-- cuDNN version : 9.10.2
-- cuSPARSELt version : 0.7.1
-- cufile library : /usr/local/cuda-12.8/lib64/libcufile.so
-- CUDA root directory : /usr/local/cuda-12.8
-- CUDA library : /usr/lib/x86_64-linux-gnu/libcuda.so
-- cudart library : /usr/local/cuda-12.8/lib64/libcudart.so
-- cublas library : /usr/local/cuda-12.8/lib64/libcublas.so
-- cufft library : /usr/local/cuda-12.8/lib64/libcufft.so
-- curand library : /usr/local/cuda-12.8/lib64/libcurand.so
-- cusparse library : /usr/local/cuda-12.8/lib64/libcusparse.so
-- cuDNN library : /usr/lib/x86_64-linux-gnu/libcudnn.so
-- cuSPARSELt library : /usr/lib/x86_64-linux-gnu/libcusparseLt.so
-- cuDSS library : /usr/lib/x86_64-linux-gnu/libcudss.so
-- nvrtc : /usr/local/cuda-12.8/lib64/libnvrtc.so
-- CUDA include path : /usr/local/cuda-12.8/include
-- NVCC executable : /usr/local/cuda-12.8/bin/nvcc
-- CUDA compiler : /usr/local/cuda-12.8/bin/nvcc
-- CUDA flags : -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_89,code=sm_89 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__
-- CUDA host compiler :
-- CUDA --device-c : OFF
-- USE_TENSORRT :
-- USE_XPU : OFF
-- USE_ROCM : OFF
-- BUILD_NVFUSER :
-- USE_EIGEN_FOR_BLAS :
-- USE_FBGEMM : ON
-- USE_FAKELOWP : OFF
-- USE_KINETO : ON
-- USE_GFLAGS : OFF
-- USE_GLOG : OFF
-- USE_LITE_PROTO : OFF
-- USE_PYTORCH_METAL : OFF
-- USE_PYTORCH_METAL_EXPORT : OFF
-- USE_MPS : OFF
-- CAN_COMPILE_METAL :
-- USE_MKL : ON
-- USE_STATIC_MKL : OFF
-- USE_MKLDNN : 1
-- USE_MKLDNN_ACL : OFF
-- USE_MKLDNN_CBLAS : OFF
-- USE_UCC : OFF
-- USE_ITT : ON
-- USE_NCCL : ON
-- USE_SYSTEM_NCCL : OFF
-- USE_NNPACK : ON
-- USE_NUMPY : ON
-- USE_OBSERVERS : ON
-- USE_OPENCL : OFF
-- USE_OPENMP : ON
-- USE_MIMALLOC : OFF
-- USE_VULKAN : OFF
-- USE_PROF : OFF
-- USE_PYTORCH_QNNPACK : ON
-- USE_XNNPACK : ON
-- USE_DISTRIBUTED : ON
-- USE_MPI : OFF
-- USE_GLOO : ON
-- USE_GLOO_WITH_OPENSSL : OFF
-- USE_TENSORPIPE : ON
-- Public Dependencies : caffe2::mkl
-- Private Dependencies : Threads::Threads;pthreadpool;cpuinfo;pytorch_qnnpack;nnpack;XNNPACK;microkernels-prod;fbgemm;ittnotify;fp16;caffe2::openmp;tensorpipe;nlohmann;gloo;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl
-- Public CUDA Deps. :
-- Private CUDA Deps. : caffe2::curand;caffe2::cufft;caffe2::cublas;torch::cudnn;torch::cusparselt;torch::cufile;__caffe2_nccl;tensorpipe_cuda;gloo_cuda;fmt::fmt-header-only;/usr/local/cuda-12.8/lib64/libcudart.so;CUDA::cusparse;CUDA::cufft;ATEN_CUDA_FILES_GEN_LIB
-- USE_COREML_DELEGATE : OFF
-- BUILD_LAZY_TS_BACKEND : ON
-- USE_ROCM_KERNEL_ASSERT : OFF
-- Performing Test HAS_WMISSING_PROTOTYPES
-- Performing Test HAS_WMISSING_PROTOTYPES - Failed
-- Performing Test HAS_WERROR_MISSING_PROTOTYPES
-- Performing Test HAS_WERROR_MISSING_PROTOTYPES - Failed
-- Configuring done (22.4s)
This is my cmake build configuration summary for my latest source build pytorch. Another thing is that the BUILD_BINARY : OFF
. Is this setting okay for source build?
And following is the source build torch config after installation.
'PyTorch built with:\n - GCC 11.4\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2025.0-Product Build 20241009 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX512\n - CUDA Runtime 12.8\n - NVCC architecture flags: -gencode;arch=compute_89,code=sm_89\n - CuDNN 91.0.2 (built against CUDA 12.9)\n - Magma 2.7.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=e2d141dbde55c2a4370fac5165b0561b6af4798b, CUDA_VERSION=12.8, CUDNN_VERSION=9.10.2, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.1, USE_CUDA=1, USE_CUDNN=1, USE_CUSPARSELT=ON, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=1, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, \n'
Is there anything I am missing here? I have also matched the above configuration with pre-build pytorch v2.7.1, there is no difference other than the versions. Currently the only change I have made is that I built .whl package of pytorch inside the conda environment and install is outside the environment and started profiling with nsight compute. I will update the result in the follow up reply.
Just to add here, I have checked that the kernels launched by ViT models with source built Pytoch is little bit different compared to pre-build pytorch. In pre-build pytorch following are the kernels:
Available Kernels:
1. CatArrayBatchedCopy
2. CatArrayBatchedCopy_aligned16_contig
3. CatArrayBatchedCopy_contig
4. DeviceCompactInitKernel
5. DeviceReduceSingleTileKernel
6. DeviceSelectSweepKernel
7. ampere_sgemm_128x64_tn
8. ampere_sgemm_32x32_sliced1x4_tn
9. elementwise_kernel
10. fmha_cutlassF_f32_aligned_64x64_rf_sm80
11. implicit_convolve_sgemm
12. index_elementwise_kernel
13. reduce_kernel
14. unrolled_elementwise_kernel
15. vectorized_elementwise_kernel
16. vectorized_layer_norm_kernel
On the other side, following are the kernels launched by source-build pytorch.
Available Kernels:
1. CatArrayBatchedCopy
2. CatArrayBatchedCopy_alignedK_contig
3. CatArrayBatchedCopy_contig
4. DeviceCompactInitKernel
5. DeviceReduceSingleTileKernel
6. DeviceSelectSweepKernel
7. _5x_cudnn_ampere_scudnn_128x128_relu_interior_nn_v1
8. ampere_sgemm_128x64_tn
9. ampere_sgemm_32x32_sliced1x4_tn
10. computeOffsetsKernel
11. elementwise_kernel
12. fmha_cutlassF_f32_aligned_64x64_rf_sm80
13. index_elementwise_kernel
14. reduce_kernel
15. unrolled_elementwise_kernel
16. vectorized_elementwise_kernel
17. vectorized_layer_norm_kernel
But the model output as well as output.logits is same in both cases. Is this kind of mismatch in kernel is the reason for different metric values in my experiment?
I don’t understand this question since the metric values should not change if the same model outputs are created. I don’t think checking the different kernels makes sense here and you should instead investigate why the same model outputs create different metric values.
Thank you for your help. I found the issue—it was a rookie mistake in my metric visualization plot. Now, the results are consistent with the pre-built version.
That said, I have one doubt. If I modify any part of the PyTorch library, especially the kernel_forward
function in the transformer module, do I need to retrain my model using the modified source-build PyTorch to observe changes in performance metrics? Or can I use the model that was trained with the original, unmodified version of source-build PyTorch?
It depends what exactly the changes do. If you change the module logic and e.g. add or remove operations I would assume your would need to retrain your model since the logic of the forward pass changed. However, if you change a kernel implementation only no significant differences should be expected as the model definition stays the same and just the kernel algorithms were changed.