I think the PR for repeat_interleave that you can find here is a very good example. You would need to change exactely the same files to have the binding. Change the implementation for the cpp and cuda code and add a test.
Note that for the interface, I think an extra argument like reduce="sum"
that can take value “sum” and “max” is good. Potentially add “mean” for convenience as well even though it can be done by dividing the result with the lenghts tensor.