When a tensor is multiplied/divided by 1 or added/subtracted by 0, we expect it to remain the same. So the most intuitive implementation appears to be:
def my_add(a, b): if is_scalar(b) and b == 0: return a if is_scalar(a) and a == 0: return b return a + b
for some type-checking function
In other words, we can short-circuit these operations instead of performing element-wise addition by 0 on every element of a tensor. However, this is not the case in Numpy or PyTorch, which can be verified by the following script:
from timeit import timeit n, c, h, w = 64, 3, 128, 128 arr = torch.rand(n, c, h, w, device="cuda") t0 = timeit("arr", globals=globals(), number=1000) t1 = timeit("arr.clone()", globals=globals(), number=1000) t2 = timeit("arr + 0", globals=globals(), number=1000) # t0 < t1 < t2
This can be especially time-consuming when a user inadvertently uses the following code as a basic building block in a network that deals with really large data batches, thinking it’s just an innocent summation:
def func(x: torch.Tensor): y = 0 for i in range(n): y += do_something(x) return y
I see no reason against such short-circuiting (operators, autograd graphs, etc.) but I couldn’t find any discussion on this topic at all. Perhaps the improvement is too small for most users? I’m personally working on edge computing so every bit of time matters a lot.