calling grad is a bottleneck in my program and so I am investigating why it is slow.
I call grad on a pair of variables (scalar and 1D tensor) from the current and previous iteration, respectively. The way I have written it (see below), I find the grad call scales roughly logarithmically with the iteration, even though it only needs gradient information from the current and previous iteration and so I assumed would not scale at all with iteration.
Please show me how I can avoid this scaling. An explanation of the scaling would also be appreciated.
Here is a MWE. I use time.time to clock, but got similar results with %%timeit.
Note that because grad only works on leaves of the graph, I got it working by storing the tensors in a list, which somehow makes them leaves, or at least avoids grad complaining. If you know of a cleaner way to do this that doesn’t require the list, I would be happy to know. Thanks!
import time import torch N = 10000 def step_forward(x): return x+1 for num_steps in [100, 1000, 10000]: x = torch.ones([N]) store_x = list() store_x.append(x) store_x[-1].requires_grad_() for i in range(num_steps): x = step_forward(x) store_x.append(x) store_x[-1].requires_grad_() st = time.time() torch.autograd.grad(store_x[i],store_x[i-1],retain_graph = True) print('T='+str(num_steps)+' took '+str(time.time() - st))
Running on my machine gives the output:
T=100 took 0.00034737586975097656
T=1000 took 0.000988006591796875
T=10000 took 0.015600204467773438