i have compiled PyTorch 2.9.0 nightly with sm_121 and Cuda 13.0 support on my DGX and it went almost smoothly (i also had to recompile torchvision and triton etc.)
What i try to do is: running FramePack inference on GB10 (DGX Spark FE) to find out about inference performance (i know there is other ways to test).
What i stumble across is: inside PyTorch sources there is flash_attn included which refers sm80.cu files, which happen to appear to create a .so (i guess lazy build depend). That of course then cause a hard error on the Spark:
FATAL: kernel fmha_cutlassF_f16_aligned_64x128_rf_sm80 is for sm80-sm100, but was built for sm121
is this a known issue? or even better, do you know about a solution?
Okay, well i did not develop Framepack environment (using Hunyuan video), so i am not 100% sure that i can change the backend entirely, but i give it a try.
Apart from that, you said DGX is fully supported, so does that mean that (soon) stable packages for sm 121 are going to be published?
Okay cuDNN SDPA did not do better. Still cutlas Error.
My point is (basically):
using FramePack in NGC 25.10-py3 (i did pull new docker) does kind of support DGX, but inference is at 10s/it, whilst my (dated) ADA 4000 notebook card reaches (varying) 9-20sec/it,
i simply cannot believe GB10 is that slow in Spark ….
The only guess was: SM121 running in PTX or other compatibility mode ….
You won’t see any performance benefits when building for sm_121 instead of sm_120 and will in the best case just waste binary space. sm_121a is still needed and used for TensorCore usage as seen here. However, since this arch-conditional compilation is only used for this single file it won’t show up in torch.cuda.get_arch_list().
Thanks, it still seems odd to me that sm121 would have no benefit over sm120 though, presumably that’s not the case in general when targeting different compute capabilities, so is there something special about 12.1?
No, there is nothing special about sm_121 and the compatibility is also not odd as it’s also used for e.g. sm_89 with sm_86 and sm_80. I.e. you will also see the lack of sm_89 for native kernels as these are binary compatible with the major architecture. TensorCore kernels are an exception as mentioned before and need special handling.
Thanks for the reply. To be clear, I was asking about performance not compatibility, I understand it’s not required to run (apart from the case you point out). Thanks!