Speeding up tensor operations by subselecting elements


I am performing a series of operations on large 1D torch tensors of the same shape to get several feature vectors. The feature vectors are then multiplied to get the desired result. I am utilizing GPU. Example is:

# large 1D torch tensors
a = torch.tensor([0.000001, 0.1, 1, 100, 1000]).to("cuda")
b = torch.tensor([0.000002, 0.2, 2, 200, 2000]).to("cuda")

# 1st feature computation
feature_a = a**2

# 2nd feature computation
feature_b = 1/(b**4)

# 3rd feature computation
feature_c = ...

# multiplying feature vectors to get the final result
result = torch.prod([feature_a, feature_b, feature_c], dim=0)

The thing is, after computation of feature_a, I can already see, that the first element of feature_a will be very close to 0, which in turn will cause the first element of result to be very close to 0.

Is there some way to look for the small values after the calculation of each feature and then to tell PyTorch that it can avoid the computation of these obvious elements in subsequent features and only spend time on the other elements?

The question is not just how to do it but if it is worth doing. The aim is to speed the calculations up, so if some indexing functions would delay the otherwise smooth GPU operations, I will probably stick to the current procedure.


NOTE: I am not interested in the gradients, only the forward pass.

As so often, you would need to profile both approaches for your use case and check if the masking operation would be more expensive then the unnecessary compute.