I have a list of tensors and all of them are present on the GPU. I obtained this list by splitting one tensor on the GPU using torch.split
. I want to get list of sums of the list of tensors I have. So, in simple terms, I want to get a list in which, the first element is sum of first tensor in the list, and so on. If I run a for loop for this, does it get parallelised? If not, is there a way to make it run parallely? I want to parallelize it since the list is pretty long, and the sum operation can be done parallely, and independently on every tensor present on the list. If this operation can be performed on the GPU, the performance gain would be immense.
I have opened a similar post on SO as well. In this post, I’ve given my usecase with example. You may want to check that out. Link