How to write parallel code in pytorch?

Hello. I need to implement a functional which includes four for loops. It is too slow to brute-force loop over the tensor. Therefore, I wonder if there is a way to write cuda parallel code to manipulate the pytorch tensor, which is also on gpu?
Is there a way like c or c++ in which we define block and thread dim, them parallelize the for in multi cuda cores?

you can use your only CUDA kernels in-line if you want, using cupy. See the code in pyinn package for example.