Hello. I need to implement a functional which includes four for loops. It is too slow to brute-force loop over the tensor. Therefore, I wonder if there is a way to write cuda parallel code to manipulate the pytorch tensor, which is also on gpu?
Is there a way like c or c++ in which we define block and thread dim, them parallelize the for in multi cuda cores?