How `torch.jit` optimizes the element-wise operation

I find that even if I write a CUDA kernel manually, the kernel is still slower than torch.jit. Thus, I am interested in how torch.jit optimizes the element-wise operation. Can the developers introduce some reasons about that?

In the current release scripted models use nvFuser to code-gen elementwise operations for CUDA kernel.
This blog post, this tutorial, and this GTC talk give you more information about its technology.