What is the backend for torch einsum on GPU? Does it use a compiler like TC or TVM?
No, it’s torch.bmm, currently.
I do have a branch implementing reductions via TensorIterator, but didn’t benchmark it (probably not terribly for cuda for small problems, less so for )
Hacking a mini-JIT pass & custom op for keops or somesuch would probably be a nice quick project (and more generally a fuser to keops would be a nice somewhat larger project). For TVM you would have to figure out how you want to optimize (i.e. benchmark on first invoke, expect that the next calls will have similar shapes, …).