Hi, I’m testing torch 2.7.1 matmul performance with cuda 12.6 on H100 GPU. Surprisingly the matmul is calling nvjet kernel instead of sm90_xmma kernel with torch 2.6.0.
Does anyone know why this behaviour is happening?
Thanks a lot for the kind help!
1 Like
This is expected as cuBLAS selects the fastest kernel for the given workload via its heuristics.
1 Like
Thanks for the clarification! Very helpful ![]()
Is there anyway we can force it to not using nvjet kernels but just use cutlass and cublas kernels?
Not directly, no. If you want to select a specific algorithm you could try to implement your own backend/heuristic via cublasltmatmulalgogetheuristic. However, note that it’s not guaranteed that other algorithms exist for all workloads besides the nvjet engine.