For example, I have 400*800 tensor A and about 10 non-zero elements for each row. Here’s how I do it:
A[A==0] = 1
result = torch.prod(A, dim=1)
I think it’s not very efficient because there are many redundant multiplications. Is there a faster way to do that on GPU?