RuntimeError: Internal Triton PTX codegen error, PTX .version 7.4 does not support .target sm_89

jrahn · January 19, 2023, 6:25pm

I’m trying to run https://github.com/karpathy/nanoGPT/train.py on an RTX 4090 / ADA card. torch.compile(model) fails with the following error:

torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/fileelTjxJ, line 6; error   : PTX .version 7.4 does not support .target sm_89
ptxas fatal   : Ptx assembly aborted due to errors

My environment:

Collecting environment information...
PyTorch version: 2.0.0.dev20230119
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.10 (x86_64)
GCC version: (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.36

Python version: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.19.0-29-generic-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA Graphics Device
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.7.0
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==2.0.0.dev20230119
[pip3] torchaudio==2.0.0.dev20230119
[pip3] torchvision==0.15.0.dev20230119
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0           py310h7f8727e_0  
[conda] mkl_fft                   1.3.1           py310hd6ae3a3_0  
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.23.5          py310hd5efca6_0  
[conda] numpy-base                1.23.5          py310h8e6c178_0  
[conda] pytorch                   2.0.0.dev20230119 py3.10_cuda11.8_cudnn8.5.0_0    pytorch-nightly
[conda] pytorch-cuda              11.8                 h8dd9ede_2    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] torchaudio                2.0.0.dev20230119     py310_cu118    pytorch-nightly
[conda] torchtriton               2.0.0+0d7e753227           py310    pytorch-nightly
[conda] torchvision               0.15.0.dev20230119     py310_cu118    pytorch-nightly

ptxas --version returns

Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

I’m not sure, where PTX v7.4 comes into play here and how to update / what to re-compile to enable sm_89 / ADA. If I understand correctly, everything in the environment should be CUDA 11.8 and that should come with PTX 7.8 and be sm_89 compatible.
Any recommendations?

Full stack trace:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 533, in _worker_compile
    kernel.precompile(warm_cache_only_with_cc=cc)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 59, in precompile
    self.launchers = [
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 60, in <listcomp>
    self._precompile_config(c, warm_cache_only_with_cc)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 73, in _precompile_config
    triton.compile(
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
    asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
    name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/fileSePVLQ, line 6; error   : PTX .version 7.4 does not support .target sm_89
ptxas fatal   : Ptx assembly aborted due to errors

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 674, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.fake_example_inputs())
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 1047, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/__init__.py", line 1264, in __call__
    return self.compile_fn(model_, inputs_)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/optimizations/backends.py", line 24, in inner
    return fn(gm, example_inputs, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/optimizations/backends.py", line 61, in inductor
    return compile_fx(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 411, in compile_fx
    return aot_autograd(
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/optimizations/training.py", line 78, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2453, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 96, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2150, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1412, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1062, in aot_dispatch_base
    compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 96, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 386, in fw_compiler
    return inner_compile(
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 586, in debug_wrapper
    compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/debug.py", line 224, in inner
    return fn(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 153, in compile_fx_inner
    compiled_fn = graph.compile_to_fn()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/graph.py", line 545, in compile_to_fn
    return self.compile_to_module().call
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 96, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/graph.py", line 534, in compile_to_module
    mod = PyCodeCache.load(code)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 504, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_jrahn/3c/c3cmse7l372boit76z5ugnr2v7pxwsof5xkbmih2v5f77zwl2n4e.py", line 1095, in <module>
    async_compile.wait(globals())
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 699, in wait
    scope[key] = result.result()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 557, in result
    self.future.result()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/fileSePVLQ, line 6; error   : PTX .version 7.4 does not support .target sm_89
ptxas fatal   : Ptx assembly aborted due to errors


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jrahn/dev/nanoGPT/train.py", line 223, in <module>
    losses = estimate_loss()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jrahn/dev/nanoGPT/train.py", line 184, in estimate_loss
    logits, loss = model(X, Y)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward
    return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 211, in _fn
    return fn(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 332, in catch_errors
    return callback(frame, cache_size, hooks)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 480, in _convert_frame
    result = inner_convert(frame, cache_size, hooks)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 103, in _fn
    return fn(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 96, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 339, in _convert_frame_assert
    return _compile(
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 400, in _compile
    out_code = transform_code_object(code, transform)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 341, in transform_code_object
    transformations(instructions, code_options)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 387, in transform
    tracer.run()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1684, in run
    super().run()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 538, in run
    and self.step()
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 501, in step
    getattr(self, inst.opname)(inst)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1750, in RETURN_VALUE
    self.output.compile_subgraph(self)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 551, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 598, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/jrahn/miniconda3/envs/pt2/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 679, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/fileSePVLQ, line 6; error   : PTX .version 7.4 does not support .target sm_89
ptxas fatal   : Ptx assembly aborted due to errors


Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

ptrblck · January 19, 2023, 6:44pm

The error is raised by Openai/Triton and is related to this issue.

jrahn · January 19, 2023, 7:03pm

Thanks for the fast reply! So updating and compiling triton should be sufficient or is there anything on PyTorch side (torchtriton?), that also needs to be changed?

ptrblck · January 19, 2023, 10:49pm

Yes, I think pytorch-triton would need to be rebuild. I’m unsure but I would assume that .github/ci_commit_pins/triton.txt would point to the OpenAI/Triton commit used for the pytorch-triton build, which was updates on Nov 6th 2022.
@malfet can correct me if I’m missing something.

NoSm89-TorchMe · January 24, 2023, 5:54pm

Oh good lord, is it really happening? This has caused no small rift between my 4090 and 3090. Even my 3080 was starting to laugh at my poor 4090. At first I thought, oh llvm, what are you going to do. And then I saw up to Sm90 in their master. I just want that sweet induction. I tried a few others and they all led to even harder to decipher pain.

aifartist · February 20, 2023, 11:15pm

But how do we actually fix this? The idea that the latest pytorch 2.0 nightly build bundle a triton ptxas that doesn’t support sm_89 is absurd. The ptxas used is based on version 11.8 or version 12 oif cuda which both support sm_89.

It is so frustrating for NVidia to start selling a 4090 last year and find that even the latest greatest cutting edge not yet GA software of torch 2.0 triton 2.0 can’t handle it. What is going on?

I think I recall someone saying sm_89 wasn’t added to the list of supported ARCH’s because there’d be no perf advantage of ADA sm_89 specific code generation. Well tell me something… Isn’t being “FUNCTIONAL” and generating ANY it/s whatsoever a perf advantage over just crashing?

Now that pytorch bundles triton do we still need to install triton seperately?

ptrblck · February 20, 2023, 11:20pm

You could check the workaround posted here.

This issue is not caused by CUDA or NVIDIA not supporting the 4090, but by a broken binary packaging process as described here.

That is correct for nvcc. However, ptxas needs to be packaged in a newer version to support the latest GPUs. Again, this is a packaging issue by OpenAI/Triton.

aifartist · February 21, 2023, 3:27am

Thanks for the clarity that provides. Installing that a2 triton does get me past that bug then I get:

[2023-02-20 18:25:02,342] torch._inductor.graph: [ERROR] Error from lowering
Traceback (most recent call last):
   File "<string>", line 21, in triton_mm
   KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-b891df919229ded8c34de46c49ad232c-4db97f5ec7972243d551c02196777493-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, torch.float16), (), (True, True, True, True))
During handling of the above exception, another exception occurred:

with a number of instances of the following:
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

I wonder if that so call “KEY” is some memory overwrite. I’m the GOAT for debugging memory overwrittes and corruption but not inside of the GPU.

ptrblck · February 21, 2023, 3:40am

This might be another valid OpenAI/Triton issue.
You could either use cuda-gdb to narrow down the memory violation or forward me a minimal and executable code snippet for further debugging.

aifartist · February 21, 2023, 4:39am

The very first error appears to be with graph.py:callfunc()

  out = lowerings[target](*args, **kwargs)

and this function is:

  lowerings[target] = <function tuned_addmm at 0x7f886d047910>

I’m not sure when I will run into a brick wall of entering into cuda but I’ll keep instrumenting py code I don’t understand to narrow it down. Maybe from this I can create a simple test case.

aifartist · February 21, 2023, 6:59am

@ptrblck

The corruption appears to be coming from a benchmarking auto generated python file. The error occurs when the function returns. Although I suspect the last line of the function.

def triton_mm(in_ptr0, arg_A, arg_B, out_ptr0):
    <all but the last line removed>
    tl.store(out_ptr0 + (xindex + tl.zeros(mask.shape, tl.int32)), tmp1, mask)

I’m trying to print out the values but I can’t even figure out how to print these values. When I print the 5 values I get:

out_ptr0   pointer<fp16>[constexpr[1]]
xindex     int32[constexpr[32],constexpr[32]]
mask.shape [constexpr[32], constexpr[32]]
tmp1       fp32[constexpr[32],constexpr[32]]
mask       int1[constexpr[32],constexpr[32]]

But if out_ptr0 is an array, I get an error with:

print(out_ptr[0])
              ^

I want to know if an array bound violation is happening. How do I print the array and see the contents. Perhaps the xindex has too large of a value.

aifartist · February 21, 2023, 7:15am

Since it is just a benchmark I commented out the tl.store(…) line and it went a lot farther running other benchmarks until some other file had the same problem. Hopefully tomorrow I’ll learn how to inspect the bounds of the vectors/arrays/matrix/tensors or whatever you call them.