I’m writing a module using libtorch with lots of point-wise operations (physics engine).
It runs pretty slow on CUDA and as I understand I’ve got to use custom kernels.
Is there an ‘official’ way to do it?
I saw this tutorial but the info relates only to python front end.
Is there a reason you must use libtorch
? The typical recommendation for PyTorch 2.0+ is to torch.compile
for workloads that have many pointwise operations amenable to e.g., operator fusion: torch.compile Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation