`sm_89` not listed in the `torch.cuda.get_arch_list()`

vgoklani · January 24, 2025, 2:41pm

We’re getting the docker image from here: nvcr.io/nvidia/pytorch:24.12-py3

and when we get the arch list:

import torch
torch.cuda.get_arch_list()

# ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'compute_90']

torch.version.cuda

# '12.6'

sm_89 is not listed. We are running 4x NVIDIA RTX 6000 Ada cards.

We also tried installing the latest version of torch via pip, but still don’t see sm_89 listed. Do we have to do something special to enable it?

Here is the CUDA information if needed:

root@~ $ nvidia-smi --version
NVIDIA-SMI version  : 565.57.01
NVML version        : 565.57
DRIVER version      : 565.57.01
CUDA Version        : 12.7

Thanks!

ptrblck · January 24, 2025, 3:30pm

No, you don’t need to build for sm_89as it’s binary compatible with sm_86/sm_80. Your device is thus supported in all of our builds.

vgoklani · January 24, 2025, 3:44pm

Thanks @ptrblck so there shouldn’t be any issues running FP8 specific kernels in pytorch? Those aren’t part of sm_86…

More specifically I’m unable to utilize row-wise scaling in FP8

github.com/pytorch/ao

How to use float8 with SM89 hardware - i.e. NVIDIA A6000 ADA?

opened 02:40PM - 11 Oct 24 UTC

closed 05:53PM - 22 Jan 25 UTC

vgoklani

question float8

I am running torchao: 0.5 and torch: '2.5.0a0+b465a5843b.nv24.09' on an NVIDIA A…6000 ADA card (sm89) which supports FP8. I ran the generate.py code from the benchmark: python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --write_result /root/benchmark_results__baseline.txt > Average tokens/sec: 57.01 > Average Bandwidth: 855.74 GB/s > Peak Memory Usage: 16.19 GB > Model Size: 15.01 GB > 20241011143042, tok/s= 57.01, mem/s= 855.74 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: None, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8 python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --quantization float8wo --write_result /root/benchmark_results__float8wo.txt` > Average tokens/sec: 57.00 > Average Bandwidth: 855.62 GB/s > Peak Memory Usage: 16.19 GB > Model Size: 15.01 GB > 20241011143316, tok/s= 57.00, mem/s= 855.62 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: float8wo, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py --quantization float8wo --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8 The `float8wo` flag does not appear to be doing anything. Am I missing a step? Thanks!

The response from the thread was that sm_89 should be listed in the arch_list.

Thanks!

ptrblck · January 24, 2025, 6:32pm

I don’t know if the current row-wise scaling kernel implementation is compatible with sm_89 and we are explicitly building them for sm_90a (arch-conditional) here.

vgoklani · January 24, 2025, 8:21pm

There was a recent PR - Add SM89 support for f8f8bf16_rowwise() by alexsamardzic · Pull Request #144348 · pytorch/pytorch · GitHub

which “introduced support for _scaled_mm operator with FP8 inputs on SM89 architecture. The support is based on CUTLASS library, that is header-only C++ library, so this new functionality gets fully built along with PyTorch build; however, it will get built only in case the build includes SM89 among targets.”

Unfortunately this requires that sm_89 is on the list of targets.

I just opened a ticket Support `sm_89` in Stable/Nightly/Docker Images · Issue #145632 · pytorch/pytorch · GitHub

Does this make sense?

ptrblck · January 24, 2025, 8:32pm

No, we should not add sm_89 directly as it will waste space with no benefits besides FP8 support. Instead, we should add sm_89 to the one file supporting FP8 as explained in my comment on GitHub.