hello, my question is besides operator fusion and CUDA graph, does it use any other technologies, such as memory management similar to vllm? Also, is there a difference between the CUDA graph used by torch.compile and replaying with torch.cuda.graph?
I tested my handwritten operator fusion and found it to be two to three times faster than the fusion done by torch.compile. Additionally, I incorporated CUDA graph for replay, yet the end-to-end inference speed is only 10% faster than torch.compile. Is there anywhere else I might have overlooked?