Why is torch.compile so fast?

hello_e · January 26, 2025, 12:00pm

hello, my question is besides operator fusion and CUDA graph, does it use any other technologies, such as memory management similar to vllm? Also, is there a difference between the CUDA graph used by torch.compile and replaying with torch.cuda.graph?

I tested my handwritten operator fusion and found it to be two to three times faster than the fusion done by torch.compile. Additionally, I incorporated CUDA graph for replay, yet the end-to-end inference speed is only 10% faster than torch.compile. Is there anywhere else I might have overlooked?

ptrblck · January 26, 2025, 4:12pm

Are you sure you are profiling the code correctly? I.e. did you compare timelines of e.g. nsys profiles or used synchronized host timers for the comparison?