Does my for loop run parallely, if all the tensors involved in the loop are on the GPU?

I have a list of tensors and all of them are present on the GPU. I obtained this list by splitting one tensor on the GPU using torch.split. I want to get list of sums of the list of tensors I have. So, in simple terms, I want to get a list in which, the first element is sum of first tensor in the list, and so on. If I run a for loop for this, does it get parallelised? If not, is there a way to make it run parallely? I want to parallelize it since the list is pretty long, and the sum operation can be done parallely, and independently on every tensor present on the list. If this operation can be performed on the GPU, the performance gain would be immense.

I have opened a similar post on SO as well. In this post, I’ve given my usecase with example. You may want to check that out. Link

A related topic was also created by me : Link