Short-circuit multiplication by 1 and addition by 0

When a tensor is multiplied/divided by 1 or added/subtracted by 0, we expect it to remain the same. So the most intuitive implementation appears to be:

def my_add(a, b):
    if is_scalar(b) and b == 0:
        return a
    if is_scalar(a) and a == 0:
        return b
    return a + b

for some type-checking function is_scalar.

In other words, we can short-circuit these operations instead of performing element-wise addition by 0 on every element of a tensor. However, this is not the case in Numpy or PyTorch, which can be verified by the following script:

from timeit import timeit
n, c, h, w = 64, 3, 128, 128
arr = torch.rand(n, c, h, w, device="cuda")
t0 = timeit("arr", globals=globals(), number=1000)
t1 = timeit("arr.clone()", globals=globals(), number=1000)
t2 = timeit("arr + 0", globals=globals(), number=1000)
# t0 < t1 < t2

This can be especially time-consuming when a user inadvertently uses the following code as a basic building block in a network that deals with really large data batches, thinking it’s just an innocent summation:

def func(x: torch.Tensor):
    y = 0
    for i in range(n):
        y += do_something(x)
    return y

I see no reason against such short-circuiting (operators, autograd graphs, etc.) but I couldn’t find any discussion on this topic at all. Perhaps the improvement is too small for most users? I’m personally working on edge computing so every bit of time matters a lot.