BackendCompilerFailed: _compile_fn raised RuntimeError: Triton requires CUDA 11.4+

Installed pytorch-nightly follow the command:

conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch-nightly -c nvidia

then tried the example of torch.compile Tutorial — PyTorch Tutorials 1.13.0+cu117 documentation ,

finally it throwed the exception:

File /usr/local/conda/lib/python3.9/site-packages/torch/_dynamo/output_graph.py:591, in OutputGraph.call_user_compiler(self, gm)
    589 except Exception as e:
    590     compiled_fn = gm.forward
--> 591     raise BackendCompilerFailed(self.compiler_fn, e) from e
    592 return compiled_fn

BackendCompilerFailed: _compile_fn raised RuntimeError: Triton requires CUDA 11.4+

As far as I know, conda had install the cuda-toolkit 11.6, so why does Pytorch throw this exception?

Any response will be appreciated.

2 Likes

Could you check the used CUDA runtime via print(torch.version.cuda)?

It’s 11.6, thanks for your response.

Could you post the code snippet raising this error, please, or are you directly executing the tutorial?

Yes, I’m executing the tutorial, the related code snippet is:

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = torch.nn.Linear(100, 10)
    
    def forward(self, x):
        return torch.nn.functional.relu(self.lin(x))

mod = MyModule()
opt_mod = torch.compile(mod)

print(opt_mod(torch.randn(10, 100)))

def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000

def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).to(torch.float32).cuda(),
        torch.randint(1000, (b,)).cuda(),
    )

N_ITERS = 10

from torchvision.models import resnet18
def init_model():
    return resnet18().to(torch.float32).cuda()

def evaluate(mod, inp):
    return mod(inp)

model = init_model()
evaluate_opt = torch.compile(evaluate, mode="reduce-overhead")

inp = generate_data(16)[0]
print("eager:", timed(lambda: evaluate(model, inp))[1])
print("compile:", timed(lambda: evaluate_opt(model, inp))[1])

Thanks for the code.
I cannot reproduce it with a current nightly conda binary using 11.6.
Env information:

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.14.0.dev20221208
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35)  [GCC 10.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-41-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 515.43.04
cuDNN version: Probably one of the following:
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.5.0
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.14.0.dev20221208
[pip3] torchvision==0.15.0.dev20221208
[conda] blas                      2.116                       mkl    conda-forge
[conda] blas-devel                3.9.0            16_linux64_mkl    conda-forge
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.1.0           h84fe81f_915    conda-forge
[conda] mkl-devel                 2022.1.0           ha770c72_916    conda-forge
[conda] mkl-include               2022.1.0           h84fe81f_915    conda-forge
[conda] numpy                     1.23.5           py38h7042d01_0    conda-forge
[conda] pytorch                   1.14.0.dev20221208 py3.8_cuda11.6_cudnn8.3.2_0    pytorch-nightly
[conda] pytorch-cuda              11.6                 h867d48c_0    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] torchtriton               2.0.0+0d7e753227            py38    pytorch-nightly
[conda] torchvision               0.15.0.dev20221208      py38_cu116    pytorch-nightly
python -c "import torch; print(torch.__version__); print(torch.version.cuda)"
1.14.0.dev20221208
11.6

Output:

python main.py 
/opt/miniforge3/envs/nightly_conda_cuda116/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:366: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.Consider setting `torch.set_float32_matmul_precision('high')`
  warnings.warn(
tensor([[2.6126e-02, 5.6818e-01, 0.0000e+00, 2.6177e-01, 4.0566e-01, 0.0000e+00,
         5.5200e-01, 0.0000e+00, 0.0000e+00, 2.0061e-01],
        [0.0000e+00, 1.4974e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 2.4196e-01,
         0.0000e+00, 2.4392e-02, 9.4952e-01, 0.0000e+00],
        [4.0854e-02, 7.2992e-01, 1.7494e-01, 0.0000e+00, 3.9046e-01, 0.0000e+00,
         0.0000e+00, 1.1856e+00, 5.1254e-01, 1.4365e+00],
        [0.0000e+00, 1.3372e+00, 0.0000e+00, 6.2340e-01, 0.0000e+00, 4.8263e-01,
         3.6486e-01, 0.0000e+00, 1.4925e-01, 4.0236e-01],
        [0.0000e+00, 0.0000e+00, 2.3011e-01, 2.8612e-01, 0.0000e+00, 2.9270e-01,
         0.0000e+00, 0.0000e+00, 5.5580e-01, 0.0000e+00],
        [1.1411e+00, 0.0000e+00, 5.2030e-01, 1.0582e+00, 5.4400e-04, 6.6906e-01,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 4.1712e-01],
        [2.8783e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 4.1535e-01, 0.0000e+00,
         0.0000e+00, 3.2157e-01, 2.4875e-01, 0.0000e+00],
        [8.6504e-01, 4.4471e-02, 5.2251e-01, 4.5288e-01, 0.0000e+00, 1.2464e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 5.8638e-01],
        [0.0000e+00, 7.3047e-01, 0.0000e+00, 0.0000e+00, 5.4288e-01, 9.1588e-01,
         0.0000e+00, 6.0390e-01, 2.5176e-01, 4.7328e-01],
        [0.0000e+00, 5.0342e-01, 1.2113e+00, 4.8887e-01, 0.0000e+00, 0.0000e+00,
         -0.0000e+00, -0.0000e+00, -0.0000e+00, 6.7853e-02]],
       grad_fn=<CompiledFunctionBackward>)
/opt/miniforge3/envs/nightly_conda_cuda116/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:366: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.Consider setting `torch.set_float32_matmul_precision('high')`
  warnings.warn(
eager: 1.265844482421875
compile: 4.382791015625

I’m not 100% sure, but would guess the Triton backend might need to use your locally installed CUDA toolkit, which might be older. Could this be the case?

My PyTorch version is 1.14.0.dev20221207.

Yes, I also doubt that Triton backend may use the locally installed CUDA, it’s 11.4.

So, should Triton use the locally installed CUDA? Thank you.

I don’t know how exactly Inductor is calling into Triton, so would need to check it by reading through the code. Will let you know once I’ve figured it out.

EDIT: Yes, it seems to use the locally installed nvcc as seen here.

Thanks, I have fixed this problem by upgrading the locally installed CUDA version to 11.6 .

So, the cause is as you said, the Triton backend uses the CUDA installed locally instead of the CUDA of conda.

1 Like

After a quick sync with @malfet he correctly pointed out that the check seems to fail in Triton directly here (not PyTorch as I’ve previously indicated).
Based on the code your locally installed CUDA 11.4 should work, so I’ll try to reproduce the issue.

I still cannot reproduce the issue using a docker container with CUDA 11.4 and the latest PyTorch nightly release using the CUDA 11.6 runtime:

root@1a038fc7e498:/workspace/src# python -c "import torch; print(torch.__version__); print(torch.version.cuda)"
2.0.0.dev20221209
11.6
root@1a038fc7e498:/workspace/src# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0
root@1a038fc7e498:/workspace/src# ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:14:30_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0

Output:

root@1a038fc7e498:/workspace/src# python tmp.py 
/opt/conda/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:366: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.Consider setting `torch.set_float32_matmul_precision('high')`
  warnings.warn(
tensor([[0.0000, 0.6264, 0.2043, 0.0000, 0.1676, 0.0000, 0.6097, 0.1066, 0.8841,
         0.0784],
        [0.0019, 0.0000, 0.0000, 0.0000, 0.3927, 0.0000, 0.0111, 0.2716, 0.6420,
         0.1461],
        [0.0000, 0.0267, 0.4337, 0.2366, 0.6798, 0.0000, 0.0865, 0.5707, 1.7802,
         0.0000],
        [0.0000, 0.0000, 0.0727, 0.0000, 0.0799, 0.0000, 0.0000, 0.0000, 0.0000,
         0.3500],
        [0.0000, 0.2702, 0.1387, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.7817],
        [0.0000, 0.5054, 0.3315, 1.0923, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.0000, 0.1780, 0.0540, 0.9178, 0.0000, 0.9666, 0.0000, 0.2082, 0.0000,
         1.1234],
        [0.0000, 0.0000, 0.2635, 0.2844, 0.2909, 0.2796, 0.0658, 0.0000, 0.0000,
         0.0000],
        [0.3576, 0.0000, 0.6487, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3771,
         0.1077],
        [0.1274, 0.8099, 0.0000, 0.9640, 0.0289, 0.4793, 0.9060, -0.0000, 0.1720,
         -0.0000]], grad_fn=<CompiledFunctionBackward>)
/opt/conda/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:366: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.Consider setting `torch.set_float32_matmul_precision('high')`
  warnings.warn(
eager: 1.9353865966796875
compile: 6.1589453125

so also your setup should work assuming your locally installed CUDA 11.4 is used.

Just confirming that I face the same issue.

I have CUDA 11.7 installed via conda, with torch==2.0.0.dev20221225+cu117, but CUDA 11.1 linked outside of conda (and indeed nvcc uses this version). This issue is resolved by installing CUDA 11.7 outside of conda.

1 Like

I also have the same issue on GCP’s Vertex AI (in this case, with CUDA 11.6).

I faced this issue.

I resolved it after installing CUDA 12.0, here:

Looks like CUDA is now open source, so the process was considerably less painful on a cloud machine than it’s ever been.

I’m facing the same issue, but just with the cuda 11.7 version. Unfortunately, i do not have rights to upgrade the local CUDA installation outside conda. Has anyone found another solution to this issue?

1 Like

Installing
conda install -c "nvidia/label/cuda-11.7.0" cuda-nvcc
seems to work for me. (also probably works for all 11.X and 12.0)

@TheDudeJohan Did you try the above?

1 Like

I am using paperspace. I can’t use conda install -c “nvidia/label/cuda-11.7.0” cuda-nvcc to solve the problem.


This problem is Pytorch or any Meta frameworks and libraries have so relied on open source, and when the open source library breaks, everything breaks.

Reproduced Issue in Colab:
(Google Colab)
Any suggestions?

@JonathanSum - I m also facing exact same issue on Colab.
image

I don’t think we will have anyone come here to help us out. We should let the Triton team knows. Can you pin them, so they can come here?