I find that even if I write a CUDA kernel manually, the kernel is still slower than torch.jit
. Thus, I am interested in how torch.jit
optimizes the element-wise operation. Can the developers introduce some reasons about that?
In the current release scripted models use nvFuser to code-gen elementwise operations for CUDA kernel.
This blog post, this tutorial, and this GTC talk give you more information about its technology.
2 Likes