Hi! I am writing cuda kernel for matmul, for research reason, (I have some ideas…so really need to write kernel by myself) I compared this kernel with cublas, the cublas is about 2 times faster than my kernel.
Later I packaged my kernel into python package(follow pytorch’s official method!!Custom C++ and CUDA Extensions — PyTorch Tutorials 1.11.0+cu102 documentation), and compared it with torch.matmul, now, about 400 times faster than my kernel!! Why it changes???
(I tested this on A100 and GTX1650 , similar result!! much slower!!!)
My guess reason: my way to package is slower than pytorch’s method???
I have uploaded all the files in my github. With detailed steps, you will feel comfortable to compile .cu files and package them into python library.
Some parameters and experiment results are:
matrix size: 64 * 5000 ~ 5000 * 5000 => 64 * 5000
on .cu result: my kernel:0.540128 ms cublas: 1.16879 ms 2 times faster!
on python result: my kernel: 197.69s pytorch: 59s first iteration, then 0.5s and stable. 400 times faster!!!
This is my code, and you can see all details!!!