DGX Spark GB10 Cuda 13.0 Python 3.12 SM_121

Hi all

i have compiled PyTorch 2.9.0 nightly with sm_121 and Cuda 13.0 support on my DGX and it went almost smoothly (i also had to recompile torchvision and triton etc.)
What i try to do is: running FramePack inference on GB10 (DGX Spark FE) to find out about inference performance (i know there is other ways to test).
What i stumble across is: inside PyTorch sources there is flash_attn included which refers sm80.cu files, which happen to appear to create a .so (i guess lazy build depend). That of course then cause a hard error on the Spark:

FATAL: kernel fmha_cutlassF_f16_aligned_64x128_rf_sm80 is for sm80-sm100, but was built for sm121

is this a known issue? or even better, do you know about a solution?

Thanks in advance

Andreas

P.S.: almost forgot about: using Py3.12 in a docker based on NGC nvcr.io/nvidia/pytorch:25.09-py3

1 Like

Try to use cuDNN SDPA as the backend in the meantime.

Okay, well i did not develop Framepack environment (using Hunyuan video), so i am not 100% sure that i can change the backend entirely, but i give it a try.
Apart from that, you said DGX is fully supported, so does that mean that (soon) stable packages for sm 121 are going to be published?

Okay cuDNN SDPA did not do better. Still cutlas Error.
My point is (basically):
using FramePack in NGC 25.10-py3 (i did pull new docker) does kind of support DGX, but inference is at 10s/it, whilst my (dated) ADA 4000 notebook card reaches (varying) 9-20sec/it,
i simply cannot believe GB10 is that slow in Spark ….
The only guess was: SM121 running in PTX or other compatibility mode ….

When will PyTorch support sm_121 architecture?

Every PyTorch build with CUDA 12.8 and sm_120 support already supports sm_121 (DGXSpark) as these architectures are binary compatible.

I also have this issue when building PyTorch from source with compute capability 12.1. It looks like PyTorch is vendoring a new enough CUTLASS on main to support sm121, so is the fix as simple as increasing these ifdefs to #if CUDA_ARCH <= 1210? pytorch/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassF_bf16_aligned.cu at main · pytorch/pytorch · GitHub

Is there a Github issue for this? I couldn’t find one, but I’d be happy to try to contribute. Thanks!

I have just tested and it appears that sm_121 still not supported, Effective PyTorch and CUDA - #25 by fg121 - DGX Spark / GB10 - NVIDIA Developer Forums

As explained before: all of our binaries for ARM built with CUDA >= 12.8 support DGXSpark already.

@ptrblck then what am i missing?

FROM nvcr.io/nvidia/pytorch:25.12-py3

import torch
torch.cuda.get_device_capability()
(12, 1)
torch.cuda.get_arch_list()
[‘sm_80’, ‘sm_86’, ‘sm_90’, ‘sm_100’, ‘sm_110’, ‘sm_120’, ‘compute_120’]

Nothing, since sm_121 is binary compatible with sm_120 which is already supported in the binaries as you are showing. You can just use them.

thank you you made me realise that there was nothing wrong with the env, simply I had to remove some debugging.

1 Like

I’d still be interested to know what, if any, performance benefit there is to compiling with 12.1 support (or even 12.1a if doing a custom build).

You won’t see any performance benefits when building for sm_121 instead of sm_120 and will in the best case just waste binary space. sm_121a is still needed and used for TensorCore usage as seen here. However, since this arch-conditional compilation is only used for this single file it won’t show up in torch.cuda.get_arch_list().

Thanks, it still seems odd to me that sm121 would have no benefit over sm120 though, presumably that’s not the case in general when targeting different compute capabilities, so is there something special about 12.1?

No, there is nothing special about sm_121 and the compatibility is also not odd as it’s also used for e.g. sm_89 with sm_86 and sm_80. I.e. you will also see the lack of sm_89 for native kernels as these are binary compatible with the major architecture. TensorCore kernels are an exception as mentioned before and need special handling.

Thanks for the reply. To be clear, I was asking about performance not compatibility, I understand it’s not required to run (apart from the case you point out). Thanks!