Self written SGEMM similar to cuBLAS on .cu, but compiled to python library, much slower than torch.matmul !?

Hi! I am writing cuda kernel for matmul, for research reason, (I have some ideas…so really need to write kernel by myself) I compared this kernel with cublas, the cublas is about 2 times faster than my kernel.
Later I packaged my kernel into python package(follow pytorch’s official method!!Custom C++ and CUDA Extensions — PyTorch Tutorials 1.11.0+cu102 documentation), and compared it with torch.matmul, now, about 400 times faster than my kernel!! Why it changes???

(I tested this on A100 and GTX1650 , similar result!! much slower!!!)

My guess reason: my way to package is slower than pytorch’s method???

I have uploaded all the files in my github. With detailed steps, you will feel comfortable to compile .cu files and package them into python library.

Some parameters and experiment results are:
matrix size: 64 * 5000 ~ 5000 * 5000 => 64 * 5000
on .cu result: my kernel:0.540128 ms cublas: 1.16879 ms 2 times faster!
on python result: my kernel: 197.69s pytorch: 59s first iteration, then 0.5s and stable. 400 times faster!!!

This is my code, and you can see all details!!!

Note that CUDA operations are executed asynchronously so you would have to synchronize the code before starting and stopping the timers. Alternatively, use torch.utils.benchmark to profile the workloads which would add warmup iterations as well as synchronizations.

1 Like

Your suggestion is correct…Even pytorch need torch.cuda.synchronize…Previously I did not realize that. Thank you!!!