Implement elementwise "ufunc" in C++/CUDA extension

I have a C++ implementation of a “special” function (an elliptic integral) that I can also modify into a cuda kernel if need be. What is the recommended way to code up an extension that applies it elementwise over tensors with automatic broadcasting, type upcasting, and CPU (parallelized) / cuda dispatch like a builtin function?

The extension tutorial shows something slightly more complicated (not elementwise) but explicitly cannot handle mixed types or different sizes. And there’s some examples of automatic mixed precision and CPU/cuda dispatch in another tutorial, again assuming equal sizes though. In fact, a similar question (concerning only broadcast) was asked before but never answered. But this seems to me like a ubiquitous task for builtin functions, so I’m hoping there’s an easy way to do it.

Maybe you could use the TensorIterator as it’s also used in e.g. UnaryOps.cpp and might fit your use case.