How to preserve autograd of tensor after .detach() and processing it?

I would suggest to ask for help to optimize the nested loop rather than doing it. Realize you can write your custom c++ module (properly done) if you can’t really optimize that nested loop. You can “run” that but consider pytorch won’t be aware of all those operations, thus, will pass wrong gradients…

Lastly just some random links

And I know there is a library which creates cuda kernels to perform optimized operations for a given code, speeding it up, but i forgot the name :confused: