Grad called on fixed distance leaves scales with graph size?

calling grad is a bottleneck in my program and so I am investigating why it is slow.

I call grad on a pair of variables (scalar and 1D tensor) from the current and previous iteration, respectively. The way I have written it (see below), I find the grad call scales roughly logarithmically with the iteration, even though it only needs gradient information from the current and previous iteration and so I assumed would not scale at all with iteration.

Please show me how I can avoid this scaling. An explanation of the scaling would also be appreciated.

Here is a MWE. I use time.time to clock, but got similar results with %%timeit.
Note that because grad only works on leaves of the graph, I got it working by storing the tensors in a list, which somehow makes them leaves, or at least avoids grad complaining. If you know of a cleaner way to do this that doesn’t require the list, I would be happy to know. Thanks!

import time
import torch

N = 10000

def step_forward(x):
    return x+1

for num_steps in [100, 1000, 10000]:
    x = torch.ones([N])
    store_x = list()
    for i in range(num_steps):
        x = step_forward(x)
    st = time.time()
    torch.autograd.grad(store_x[i][0],store_x[i-1],retain_graph = True)[0]
    print('T='+str(num_steps)+' took '+str(time.time() - st))

Running on my machine gives the output:

T=100 took 0.00034737586975097656
T=1000 took 0.000988006591796875
T=10000 took 0.015600204467773438