Let’s say I have a 2d tensor, named x. I have also a python list (a) which is of the form a = [a_1, a_2, …, a_n]. a_i is a list that contains column indexes of the tensor x. I want to compute the sum of the columns of x for each a_i as efficiently as possible in the GPU. For this reason, I do:

new_tensor = torch.stack([torch.sum(x[:, i], 1) for i in a], 1).

However, even though tensor x is loaded to GPU already, the process takes way too long and the GPU utilization during execution drops to 0%.

Any idea how to make it run faster on GPU?

Thanks!