Indexing a large tensor is very slow

Hey all, we have two tensors. The first is of size (S1, N) and the other is (N, S2). Both are considerably large matrices, with S1 = ~200k, N = 10000 and S2 = ~200k
Requirement is to do dot product of a few rows (a couple 10s) of the first tensor with the complete second tensor. The few rows are given by a mask, mask. And there are many such masks for which we need to calculate the dot product.

Right now, if we’re doing.

for mask in masks:
    t = S1[mask]
    result = torch.matmul(t, S2)

But it’s very slow.
We found that indexing through a tensor actually copies the result as opposed to indexing through an integer or a slice which returns a view.

We guess that S1[mask] creates another tensor which seems to be the bottleneck since S1 is a pretty large matrix.

What are our options?
Sparse matrices? Writing custom CUDA code? Any other way?

Thanks!

I don’t know if the copy is the actual bottleneck or the synchronization if a BoolTensor is used as a mask since you are adding data-dependent logic to your code (the output shape depends on the content of the mask, not its metadata).
You could try to replace the mask with an index tensor and could then profile your code again to ee if these copies are actually so expensive as you think.

Hi bro, did you solve this problem?