How do we write CUDA kernels without writing CUDA today?

Tensor Comprehensions are discontinued (rather quietly). Triton seems too complicated? Or not, if you have experience to share on it :slight_smile:

So what do we use today like TC? Anything on the horizon, even?

Perhaps some documentation on guiding the JIT towards similar performance to writing fused kernels?

There is indeed ongoing work in the different JIT backends, which use code generation to create (fused) kernels. I’m a bit familiar with the internals of the nvfuser work, but unfortunately cannot link to a proper documentation, as it’s still in an early stage.
In any case, I believe to see more code generation approaches in the future, which would make writing custom operations easier in the framework :wink:

1 Like

So the current plan is JIT all the way? No lower-level script language for kernels specifically (like TCs)?

And if it is going to be JIT-based, are there any plans/work for a static diagnostic that tells you how well certain parts of the source TorchScript have managed to generate fused low-level code?

Something like a function summarize_compilation(script(model, method)) that returns a report on how certain sections fused or didn’t. Or something. You can definitely tell I’ve no idea about this area except that “fused–>probably good” :sweat_smile: