Torch.matmul calls nvjet_tst kernel in torch 2.7.1

TerrenceZhangX · June 20, 2025, 5:12pm

Hi, I’m testing torch 2.7.1 matmul performance with cuda 12.6 on H100 GPU. Surprisingly the matmul is calling nvjet kernel instead of sm90_xmma kernel with torch 2.6.0.
Does anyone know why this behaviour is happening?
Thanks a lot for the kind help!

ptrblck · June 20, 2025, 6:23pm

This is expected as cuBLAS selects the fastest kernel for the given workload via its heuristics.

TerrenceZhangX · June 20, 2025, 6:56pm

Thanks for the clarification! Very helpful

Luke20000429 · January 9, 2026, 8:44pm

Is there anyway we can force it to not using nvjet kernels but just use cutlass and cublas kernels?

ptrblck · January 10, 2026, 6:01pm

Not directly, no. If you want to select a specific algorithm you could try to implement your own backend/heuristic via cublasltmatmulalgogetheuristic. However, note that it’s not guaranteed that other algorithms exist for all workloads besides the nvjet engine.