I am looking to write a custom kernel for the backward pass in my model to speedup training. The tutorial posted on PyTorch website goes into detail writing c++ and cuda code
https://pytorch.org/tutorials/advanced/cpp_extension.html
Given that numba is pretty straightforward for writing cuda kernels, would there be any performance hurdle in doing so? It seems like data on the GPU can be passed between numba and pytorch without much performance issue.
Has anyone tried this? Does anyone have some words of warning against doing so?
Thanks!