PyTorch 1.2.0 with compute capability 3.5

tholzmann · August 21, 2019, 9:27am

Hi,

I am using PyTorch 1.2.0 self-compiled with CUDA compute capability 5.2 with C++ and everything works as expected.

I read somewhere that everything down to compute capability 3.5 is supported. Hence, as we aim to support as many graphic cards as possible, i tried to compile PyTorch with compute capability 3.5. However, I get an error when using this version:

.THCudaCheck FAIL file=D:/tools/pytorch-v1.2.0/aten/src\THC/generic/THCTensorMath.cu line=16 error=209 : no kernel image is available for execution on the device
exception message: cuda runtime error (209) : no kernel image is available for execution on the device at D:/tools/pytorch-v1.2.0/aten/src\THC/generic/THCTensorMath.cu:16
The above operation failed in interpreter, with the following stack trace:
at code/model-input_rgbip-output_14classes_best_train_2019_08_03_cpu-eval-mode-export_latest_pytorch.py:292:12
_135 = getattr(_131, “1”)
_136 = _135.weight
_137 = _135.bias
_138 = getattr(self.decoder0, “0”)
_139 = _138.weight
_140 = _138.bias
_141 = getattr(self.logit, “0”)
_142 = _141.weight
_143 = _141.bias
input0 = torch._convolution(input, 1, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1, True, False, True)
~~~~~~~~~~~~~~~~~~ <— HERE
input1 = torch.batch_norm(input0, weight, bias, running_mean, running_var, False, 0.10000000000000001, 1.0000000000000001e-05, True)
input2 = torch.relu(input1)
input3 = torch.max_pool2d(input2, [3, 3], [2, 2], [1, 1], [1, 1], False)
input4 = torch._convolution(input3, 5, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, True, False, True)
input5 = torch.batch_norm(input4, weight0, bias0, running_mean0, running_var0, False, 0.10000000000000001, 1.0000000000000001e-05, True)
input6 = torch.relu(input5)
input7 = torch._convolution(input6, 7, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, True, False, True)
out = torch.batch_norm(input7, weight1, bias1, running_mean1, running_var1, False, 0.10000000000000001, 1.0000000000000001e-05, True)
input8 = torch.add(out, input3, alpha=1)Compiled from code /opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py(340): forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(523): _slow_forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(537): call
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py(92): forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(523): _slow_forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(537): call
…/dl/models/unet.py(153): forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(523): _slow_forward
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(537): call
/opt/conda/lib/python3.6/site-packages/torch/jit/init.py(883): trace_module
/opt/conda/lib/python3.6/site-packages/torch/jit/init.py(751): trace
(5):
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py(3296): run_code
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py(3214): run_ast_nodes
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py(3049): run_cell_async
/opt/conda/lib/python3.6/site-packages/IPython/core/async_helpers.py(67): _pseudo_sync_runner
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py(2874): _run_cell
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py(2848): run_cell
/opt/conda/lib/python3.6/site-packages/ipykernel/zmqshell.py(536): run_cell
/opt/conda/lib/python3.6/site-packages/ipykernel/ipkernel.py(294): do_execute
/opt/conda/lib/python3.6/site-packages/tornado/gen.py(209): wrapper
/opt/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py(534): execute_request
/opt/conda/lib/python3.6/site-packages/tornado/gen.py(209): wrapper
/opt/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py(267): dispatch_shell
/opt/conda/lib/python3.6/site-packages/tornado/gen.py(209): wrapper
/opt/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py(357): process_one
/opt/conda/lib/python3.6/site-packages/tornado/gen.py(742): run
/opt/conda/lib/python3.6/site-packages/tornado/gen.py(781): inner
/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py(743): _run_callback
/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py(690):
/opt/conda/lib/python3.6/asyncio/events.py(145): _run
/opt/conda/lib/python3.6/asyncio/base_events.py(1451): _run_once
/opt/conda/lib/python3.6/asyncio/base_events.py(438): run_forever
/opt/conda/lib/python3.6/site-packages/tornado/platform/asyncio.py(148): start
/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py(505): start
/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py(658): launch_instance
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py(16):
/opt/conda/lib/python3.6/runpy.py(85): _run_code
/opt/conda/lib/python3.6/runpy.py(193): _run_module_as_main

So I assume compute capability 3.5 is also not fully supported anymore?
Down to which compute capability PyTorch 1.2.0 should work correctly?

My system:
Windows 10
Visual Studio 2019 - CUDA 10.1
Python 3.7
Self-Compiled PyTorch 1.2.0

Thanks!

Best,
Thomas

tholzmann · August 21, 2019, 2:40pm

I tried to build with compute capability 5.0 now, and this build is working. So I assume PyTorch requires 5.0 as minimum compute capability. Is this correct?

albanD · August 21, 2019, 7:03pm

Hi,

The comments about pytorch working all the way down to cc3.5 are quite old. I’m afraid this is not true anymore.

peterjc123 · August 24, 2019, 6:33am

github.com

pytorch/builder/blob/master/conda/pytorch-1.1.0/bld.bat#L18


    set build_with_cuda=
) else (
    set build_with_cuda=1
    set desired_cuda=%CUDA_VERSION:~0,-1%.%CUDA_VERSION:~-1,1%
)


if "%build_with_cuda%" == "" goto cuda_flags_end


set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%desired_cuda%
set CUDA_BIN_PATH=%CUDA_PATH%\bin
set TORCH_CUDA_ARCH_LIST=3.5;5.0+PTX
if "%desired_cuda%" == "8.0" set TORCH_CUDA_ARCH_LIST=%TORCH_CUDA_ARCH_LIST%;6.0;6.1
if "%desired_cuda%" == "9.0" set TORCH_CUDA_ARCH_LIST=%TORCH_CUDA_ARCH_LIST%;6.0;7.0
if "%desired_cuda%" == "9.2" set TORCH_CUDA_ARCH_LIST=%TORCH_CUDA_ARCH_LIST%;6.0;6.1;7.0
if "%desired_cuda%" == "10.0" set TORCH_CUDA_ARCH_LIST=%TORCH_CUDA_ARCH_LIST%;6.0;6.1;7.0;7.5
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all


:cuda_flags_end


set DISTUTILS_USE_SDK=1

github.com

pytorch/builder/blob/master/windows/cuda100.bat#L36


    echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
    exit /b 1
    goto optcheck
)


IF "%CUDA_PATH_V10_0%"=="" (
    echo CUDA 10.0 not found, failing
    exit /b 1
) ELSE (
    IF "%BUILD_VISION%" == "" (
        set TORCH_CUDA_ARCH_LIST=3.5;5.0+PTX;6.0;6.1;7.0;7.5
        set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
    ) ELSE (
        set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_50,code=compute_50
    )


    set "CUDA_PATH=%CUDA_PATH_V10_0%"
    set "PATH=%CUDA_PATH_V10_0%\bin;%PATH%"
)


:optcheck

github.com

pytorch/builder/blob/master/windows/cuda90.bat#L36


    echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
    exit /b 1
    goto optcheck
)


IF "%CUDA_PATH_V9_0%"=="" (
    echo CUDA 9 not found, failing
    exit /b 1
) ELSE (
    IF "%BUILD_VISION%" == "" (
        set TORCH_CUDA_ARCH_LIST=3.5;5.0+PTX;6.0;7.0
        set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
    ) ELSE (
        set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_50,code=compute_50
    )


    set "CUDA_PATH=%CUDA_PATH_V9_0%"
    set "PATH=%CUDA_PATH_V9_0%\bin;%PATH%"
)


:optcheck

Actually 3.5 is in the CUDA_ARCH_LIST, I wonder why that is not supported.

tholzmann · August 28, 2019, 1:52pm

Interestingly, when I use the prebuild windows version built with CUDA/CuDNN, it works on a graphics card with cc 3.7 (Tesla K80). However, when I build pytorch myself with CUDA but without CuDNN, it does not work on this graphics card.

Is it possible that some instructions are only implemented for a higher cc in CUDA, but if CuDNN is used these instructions are implemented with CuDNN and hence it works?