Libtorch + CUDA kernels

I’m writing a module using libtorch with lots of point-wise operations (physics engine).
It runs pretty slow on CUDA and as I understand I’ve got to use custom kernels.
Is there an ‘official’ way to do it?
I saw this tutorial but the info relates only to python front end.

Is there a reason you must use libtorch? The typical recommendation for PyTorch 2.0+ is to torch.compile for workloads that have many pointwise operations amenable to e.g., operator fusion: torch.compile Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation