Why is torch.compile so fast?

hello, my question is besides operator fusion and CUDA graph, does it use any other technologies, such as memory management similar to vllm? Also, is there a difference between the CUDA graph used by torch.compile and replaying with torch.cuda.graph?

I tested my handwritten operator fusion and found it to be two to three times faster than the fusion done by torch.compile. Additionally, I incorporated CUDA graph for replay, yet the end-to-end inference speed is only 10% faster than torch.compile. Is there anywhere else I might have overlooked?

Are you sure you are profiling the code correctly? I.e. did you compare timelines of e.g. nsys profiles or used synchronized host timers for the comparison?