Performance Issue: torch.matmul selecting cutlass sm75 kernel for A100

torchlearner · June 9, 2025, 7:02pm

I have an application which does torch.matmul on large tensors. Typical dimensions for my use cases are like A (32, 3072), B (3072, 4_000_000), where 32 is the batch size M, 3072 is the embedding dimension K, and 4_000_000 is N. Inputs are all fp16 (half). Other dimensions are M 32, K 512, N 16M etc.

I am using torch 2.7. I see from a profile when run on a A100, that it is using cutlass sm75 kernels like cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nn_align1 for ~65% of the profile. Remaining 35% of the time it is using cutlass sm80 kernels like cutlass_tensorop_f16_s16816gemm_f16_128x64_64x3_nt_align8. The profile is taken during a benchmark run of my app with 1000 requests. For a single request it either picks a sm75 kernel or a sm80 kernel. We choose the request tensors using torch.rand.

I am looking for some insights on the kernel selection logic for torch. Specifically, I see significant perf difference (latency, TFLOPS) by running the cutlass sm80 kernels directly. I am looking for ways in which I can influence torch to always select sm80 kernels on A100 and sm90 kernels on H100.

ptrblck · June 9, 2025, 7:32pm

The kernels are selected by cuBLAS using their heuristics and not defined by PyTorch.

torchlearner · June 9, 2025, 8:11pm

Thanks. Does torch control when cuBLAS is used vs cutlass ? In my profile with torch 2.7 I see that all the kernels are cutlass and not cublas. In a H100 node, I saw a combination of cutlass sm75, cutlass sm80 and a cuBlas sm90 kernel.

Does torch control whether the backend is cuBLAS or cutlass or is cuBLAS controlling that ?
If torch is controlling that, any pointers to how torch chooses between cuBLAS and cutlass ?

Screenshot 2025-06-09 at 1.07.07 PM2560×174 41.9 KB

Screenshot 2025-06-09 at 1.12.24 PM2848×234 59.7 KB

ptrblck · June 10, 2025, 4:06pm

cuBLAS is able to call kernels from different backends including CUTLASS.

torchlearner · June 11, 2025, 7:55pm

Thanks for your reply.
In this case, I figured out that the sm75 kernel was getting selected when the input tensor was not contiguous. If the tensor was contiguous it was always choosing the sm80 kernel.