Computing multiple sums over columns efficiently in GPU

Let’s say I have a 2d tensor, named x. I have also a python list (a) which is of the form a = [a_1, a_2, …, a_n]. a_i is a list that contains column indexes of the tensor x. I want to compute the sum of the columns of x for each a_i as efficiently as possible in the GPU. For this reason, I do:

new_tensor = torch.stack([torch.sum(x[:, i], 1) for i in a], 1).

However, even though tensor x is loaded to GPU already, the process takes way too long and the GPU utilization during execution drops to 0%.

Any idea how to make it run faster on GPU?

Thanks!

Would it be possible to create an index tensor of a and index x only once instead of in a loop?
Could you print the shapes and values for some dummy inputs of x and a?

x could be x = random(5,5) (5 x 5 random tensor). Also a = [[0,1], [1,2,3], [4]] (In general entries of a have different lengths).

I don’t follow on the “only once” part you mention above.