Hi,
calling grad is a bottleneck in my program and so I am investigating why it is slow.
I call grad on a pair of variables (scalar and 1D tensor) from the current and previous iteration, respectively. The way I have written it (see below), I find the grad call scales roughly logarithmically with the iteration, even though it only needs gradient information from the current and previous iteration and so I assumed would not scale at all with iteration.
Please show me how I can avoid this scaling. An explanation of the scaling would also be appreciated.
Here is a MWE. I use time.time to clock, but got similar results with %%timeit.
Note that because grad only works on leaves of the graph, I got it working by storing the tensors in a list, which somehow makes them leaves, or at least avoids grad complaining. If you know of a cleaner way to do this that doesn’t require the list, I would be happy to know. Thanks!
import time
import torch
N = 10000
def step_forward(x):
return x+1
for num_steps in [100, 1000, 10000]:
x = torch.ones([N])
store_x = list()
store_x.append(x)
store_x[-1].requires_grad_()
for i in range(num_steps):
x = step_forward(x)
store_x.append(x)
store_x[-1].requires_grad_()
st = time.time()
torch.autograd.grad(store_x[i][0],store_x[i-1],retain_graph = True)[0]
print('T='+str(num_steps)+' took '+str(time.time() - st))
Running on my machine gives the output:
T=100 took 0.00034737586975097656
T=1000 took 0.000988006591796875
T=10000 took 0.015600204467773438