The cuda kernel used by pytorch matrix multiplication

MalaJeans · July 7, 2023, 5:20pm

When testing matrix multiplication with pytorch，
If the scale of matrix multiplication is m=10240,n=5120,k=5120.The cuda kernel used by pytorch matrix multiplication is：

but when the scale of matrix multiplication is m=40960,n=20480,k=10240,the result is:

Question:
when m=40960,n=20480,k=10240,the cuda kernel not in use?

the code is:

import torch
import time

torch.backends.cuda.matmul.allow_tf32 = True

m = 40960
n = 20480
k = 5120

input = torch.randn(m, k, dtype=torch.float32,device=‘cuda’)
weight = torch.randn(k, n, dtype=torch.float32,device=‘cuda’)

output = torch.matmul(input, weight)

ptrblck · July 7, 2023, 8:39pm

The second output doesn’t show any matmul kernel execution.
Which device are you using and were you running into an error (e.g. OOM)?

MalaJeans · July 9, 2023, 1:51am

My device is RTX A5000.
What is the function of the cuda kernel “distribution elementwise grid stride kernel” ？
I looked at it again. I think it was OOM. Thanks~

ptrblck · July 9, 2023, 5:55pm

This kernel should be called by the randn operation.

MalaJeans · July 10, 2023, 1:28am

I get it！
Thank you your reply！