vgoklani
(Vishal Goklani)
January 24, 2025, 2:41pm
1
We’re getting the docker image from here: nvcr.io/nvidia/pytorch:24.12-py3
and when we get the arch list:
import torch
torch.cuda.get_arch_list()
# ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'compute_90']
torch.version.cuda
# '12.6'
sm_89
is not listed. We are running 4x NVIDIA RTX 6000 Ada cards.
We also tried installing the latest version of torch via pip, but still don’t see sm_89
listed. Do we have to do something special to enable it?
Here is the CUDA information if needed:
root@~ $ nvidia-smi --version
NVIDIA-SMI version : 565.57.01
NVML version : 565.57
DRIVER version : 565.57.01
CUDA Version : 12.7
Thanks!
No, you don’t need to build for sm_89
as it’s binary compatible with sm_86/sm_80
. Your device is thus supported in all of our builds.
vgoklani
(Vishal Goklani)
January 24, 2025, 3:44pm
3
Thanks @ptrblck so there shouldn’t be any issues running FP8 specific kernels in pytorch? Those aren’t part of sm_86…
More specifically I’m unable to utilize row-wise scaling in FP8
opened 02:40PM - 11 Oct 24 UTC
closed 05:53PM - 22 Jan 25 UTC
question
float8
I am running torchao: 0.5 and torch: '2.5.0a0+b465a5843b.nv24.09' on an NVIDIA A… 6000 ADA card (sm89) which supports FP8.
I ran the generate.py code from the benchmark:
python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --write_result /root/benchmark_results__baseline.txt
> Average tokens/sec: 57.01
> Average Bandwidth: 855.74 GB/s
> Peak Memory Usage: 16.19 GB
> Model Size: 15.01 GB
> 20241011143042, tok/s= 57.01, mem/s= 855.74 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: None, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8
python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --quantization float8wo --write_result /root/benchmark_results__float8wo.txt`
> Average tokens/sec: 57.00
> Average Bandwidth: 855.62 GB/s
> Peak Memory Usage: 16.19 GB
> Model Size: 15.01 GB
> 20241011143316, tok/s= 57.00, mem/s= 855.62 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: float8wo, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py
--quantization float8wo --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8
The `float8wo` flag does not appear to be doing anything. Am I missing a step? Thanks!
The response from the thread was that sm_89 should be listed in the arch_list.
Thanks!
I don’t know if the current row-wise scaling kernel implementation is compatible with sm_89
and we are explicitly building them for sm_90a
(arch-conditional) here .
1 Like
vgoklani
(Vishal Goklani)
January 24, 2025, 8:21pm
5
There was a recent PR - Add SM89 support for f8f8bf16_rowwise() by alexsamardzic · Pull Request #144348 · pytorch/pytorch · GitHub
which “introduced support for _scaled_mm
operator with FP8 inputs on SM89 architecture. The support is based on CUTLASS library, that is header-only C++ library, so this new functionality gets fully built along with PyTorch build; however, it will get built only in case the build includes SM89 among targets.”
Unfortunately this requires that sm_89
is on the list of targets.
I just opened a ticket Support `sm_89` in Stable/Nightly/Docker Images · Issue #145632 · pytorch/pytorch · GitHub
Does this make sense?
No, we should not add sm_89
directly as it will waste space with no benefits besides FP8 support. Instead, we should add sm_89
to the one file supporting FP8 as explained in my comment on GitHub.