Low level operations comparative performance

In PyTorch there are often multiple ways to do the same simple operations while having sometimes very different computational costs.
A few examples:
-indexing is faster than index_select
-indexing on a later dimension rather than the first dimension causes subsequent operations such as torch.bucketize to be slower because the resulting tensor is no longer contiguous
-torch.prod is faster than multiplying indexed tensors
-torch.expand is faster than torch.repeat because there is no copy

I was wondering if there is a comprehensive list of such low level common performance “tricks”? Also is it possible for an implementation to be faster than the alternative on the forward pass but slower than the alternative on the backward pass? I am mainly interested in the comparative performance of low level basic operations like the ones above.

1 Like