Hi all, are there any related works on “megakernel” here?
I’m planning to research this topic using torch.compile.
According to the article: “We bypass a key issue by merging the entire Llama-1B forward pass into a single ‘megakernel,’ eliminating kernel boundaries. On an H100, this achieves 78% memory bandwidth utilization and outperforms existing systems by over 1.5x.”