I think this may be related to the recent change in the conda packages. I just created a new conda environment, installed pytorch according to the official documentation(conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
) and tried to compile apex from source. The pip
commandline looks like this:
/vc_data/users/heyangqin/anaconda3/envs/deepspeed/bin/nvcc -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/include -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/include/TH -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/include/THC -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/include -I/vc_data/users/heyangqin/anaconda3/envs/deepspeed/include/python3.10 -c -c /vc_data/users/heyangqin/apex/csrc/multi_tensor_sgd_kernel.cu -o /vc_data/users/heyangqin/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_sgd_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14
This pip
compile commandline calls the nvcc
in the conda env and it does not include the system CUDA dir /usr/local/cuda/include/
where the cusolverDn.h
locates which causes the error. So I manually updated the PATH
by export PATH=/usr/local/cuda/bin:$PATH
and the error is gone. I wonder if this is the intended behavior?