When a tensor is multiplied/divided by 1 or added/subtracted by 0, we expect it to remain the same. So the most intuitive implementation appears to be:

```
def my_add(a, b):
if is_scalar(b) and b == 0:
return a
if is_scalar(a) and a == 0:
return b
return a + b
```

for some type-checking function `is_scalar`

.

In other words, we can short-circuit these operations instead of performing element-wise addition by 0 on every element of a tensor. However, this is not the case in Numpy or PyTorch, which can be verified by the following script:

```
from timeit import timeit
n, c, h, w = 64, 3, 128, 128
arr = torch.rand(n, c, h, w, device="cuda")
t0 = timeit("arr", globals=globals(), number=1000)
t1 = timeit("arr.clone()", globals=globals(), number=1000)
t2 = timeit("arr + 0", globals=globals(), number=1000)
# t0 < t1 < t2
```

This can be especially time-consuming when a user inadvertently uses the following code as a basic building block in a network that deals with really large data batches, thinking it’s just an innocent summation:

```
def func(x: torch.Tensor):
y = 0
for i in range(n):
y += do_something(x)
return y
```

I see no reason against such short-circuiting (operators, autograd graphs, etc.) but I couldn’t find any discussion on this topic at all. Perhaps the improvement is too small for most users? I’m personally working on edge computing so every bit of time matters a lot.