Hi, I’m testing torch 2.7.1 matmul performance with cuda 12.6 on H100 GPU. Surprisingly the matmul is calling nvjet kernel instead of sm90_xmma kernel with torch 2.6.0.
Does anyone know why this behaviour is happening?
Thanks a lot for the kind help!
This is expected as cuBLAS selects the fastest kernel for the given workload via its heuristics.
Thanks for the clarification! Very helpful