Torch.matmul launch different CUDA kernel from cublas

Does torch.matmul always call the fastest cuda kernel ?

I tested torch.matmul and cublas matmul, I find there are some difference between the kernels and performance.

I thought they would call the same kernel, and thus always get the same performance, but seems they call different cuda kernels.

Why is that?

Does torch.matmul always have the same performance with cublas?

Thank you!

The code I’m using

void cublasTensorOp(half *A, half *B, half *C, size_t M, size_t N, size_t K) {
    static cublasHandle_t handle = getCublasTensorOpHandle();
    static half alpha = 1.0;
    static half beta = 0.0;

    HGEMM_CHECK_CUBLAS_ERROR(cublasGemmEx(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, M, K, 
                                          &alpha, B, CUDA_R_16F, K, A,
                                          CUDA_R_16F, K, &beta, C, CUDA_R_16F, N, 
                                          CUBLAS_COMPUTE_32F,
                                          CUBLAS_GEMM_DEFAULT_TENSOR_OP));
}
import torch
a = torch.ones((4096, 5120), dtype=torch.float16).cuda()
b = torch.ones((5120, 2*27648), dtype=torch.float16).cuda()

c = torch.ones((4096, 2*27648), dtype=torch.float16).cuda()

torch.matmul(a, b, out = c)

cnt = 1000
torch.cuda.synchronize()
import time
start = time.time()
for i in range(cnt):
    torch.matmul(a, b, out = c)

torch.cuda.synchronize()
end = time.time()

print((end - start)*1000/cnt)

Did you compare your cublas implementation against the one used in PyTorch?

you mean performance? or the source C++ code inside pytorch matmul?

I didn’t trace into pytorch to see the invocation.