TL;DR, does computing torch.var(x)
on a 1D tensor of n values have a quadratic runtime with autograd? Or does autograd it have a linear runtime in n?
Longer version:
If I have n values in a 1D tensor, and I compute var within a nn.Module, will autograd compute an n by n Jacobian in quadratic runtime?
For example, if we have values
x = torch.tensor([1.,2.,4.,8.,16.,32.,64.], requires_grad=True)
then y = torch.var(x)
will set grad_fn
in y
.
The variance can be conceptualized as computing all pairwise distances (with a quadratic number of operations) and shifting and scaling:
torch.sum( (x.view(-1,1) - x.view(1,-1))**2 / (2*len(x)*(len(x)-1)) )
Or the varaince can be conceptualized as caching the mean (with a linear number of operations), computing all distances to that mean (also linear) and then scaling the result:
torch.sum( (x - torch.mean(x))**2 / (len(x)-1) )
In the first way, it’s clear that all pairs of values in x
affect torch.var
, and so an n by n matrix will be computed by autograd.
In the second (more standard way), the mean is cached, with autograd depicting a dependency of the mean on each value from x
. Then, the mean collides with every value from x
to compute the variance. So in this case, it appears the gradients will be quadratic as well. Is that the case?
Also, how is there a way that I could’ve delved into y.grad_fn
to determine this for myself without asking and without empirically measuring whether the runtimes seem to grow quadratically? Thank you!