How `torch.jit` optimizes the element-wise operation

fangwei123456 · November 15, 2022, 2:52am

I find that even if I write a CUDA kernel manually, the kernel is still slower than torch.jit. Thus, I am interested in how torch.jit optimizes the element-wise operation. Can the developers introduce some reasons about that?

ptrblck · November 15, 2022, 6:00am

In the current release scripted models use nvFuser to code-gen elementwise operations for CUDA kernel.
This blog post, this tutorial, and this GTC talk give you more information about its technology.