DGX Spark GB10 Cuda 13.0 Python 3.12 SM_121

Hi all

i have compiled PyTorch 2.9.0 nightly with sm_121 and Cuda 13.0 support on my DGX and it went almost smoothly (i also had to recompile torchvision and triton etc.)
What i try to do is: running FramePack inference on GB10 (DGX Spark FE) to find out about inference performance (i know there is other ways to test).
What i stumble across is: inside PyTorch sources there is flash_attn included which refers sm80.cu files, which happen to appear to create a .so (i guess lazy build depend). That of course then cause a hard error on the Spark:

FATAL: kernel fmha_cutlassF_f16_aligned_64x128_rf_sm80 is for sm80-sm100, but was built for sm121

is this a known issue? or even better, do you know about a solution?

Thanks in advance

Andreas

P.S.: almost forgot about: using Py3.12 in a docker based on NGC nvcr.io/nvidia/pytorch:25.09-py3

Try to use cuDNN SDPA as the backend in the meantime.

Okay, well i did not develop Framepack environment (using Hunyuan video), so i am not 100% sure that i can change the backend entirely, but i give it a try.
Apart from that, you said DGX is fully supported, so does that mean that (soon) stable packages for sm 121 are going to be published?

Okay cuDNN SDPA did not do better. Still cutlas Error.
My point is (basically):
using FramePack in NGC 25.10-py3 (i did pull new docker) does kind of support DGX, but inference is at 10s/it, whilst my (dated) ADA 4000 notebook card reaches (varying) 9-20sec/it,
i simply cannot believe GB10 is that slow in Spark ….
The only guess was: SM121 running in PTX or other compatibility mode ….