Okay cuDNN SDPA did not do better. Still cutlas Error.
My point is (basically):
using FramePack in NGC 25.10-py3 (i did pull new docker) does kind of support DGX, but inference is at 10s/it, whilst my (dated) ADA 4000 notebook card reaches (varying) 9-20sec/it,
i simply cannot believe GB10 is that slow in Spark ….
The only guess was: SM121 running in PTX or other compatibility mode ….