Hi, a question about how the autograd engine works for intermediate values.
Suppose I have an equation (a*b + c*d) = e, pretty common in matrix multiplication. My understanding is that under the hood the calculation looks something like
_t1 = a * b
_t2 = c * d
e = _t1 + _t2
Then when .backward() is called, the gradient has to flow through e to _t1 and _t2 before reaching a,b,c,d.
For large matrices, it seems that storing the gradient information of all these temporary values will be expensive memory wise but they also (?) seem necessary to be able to backpropagate to the a,b,c and d tensors.
I am guessing this has something to do with .retain_grad() but am not too sure. Any help is appreciated to help clarify my understanding.
Note that * denotes element-wise multiplication, where for matrix
multiplication, one would use @. (However, my answer applies equally
well to both cases.)
Yes, if you write the python statement e = a * b + c * d, python will do
essentially what you’ve written down, except that a * b and c * d won’t
have explicit references bound to them unless autograd decides (which it
doesn’t) that it needs to save those tensors for the backward pass.
Yes (depending upon which of the tensors a, b, c, and d carry requires_grad = True).
Roughly speaking, yes. The various temporary tensors that autograd
stores during the forward pass (because it needs them to perform the
backward pass) are almost always the dominant cost in memory when
training a model (forward pass, backward pass, optimization step).
In general, evaluation (forward pass, but with no need to save anything
for a backward pass) requires significantly less memory than training.
Let’s look at what autograd will save in your example. (In general
autograd doesn’t save any and all intermediate tensors – it’s pretty
smart about only saving what it actually needs for the backward pass.)
Let me assume that all of a, b, c, and d carry requires_grad = True.
To compute the derivative of a * b with respect to a, autograd needs
to save b. Similarly, to compute the derivative of a * b with respect to b, autograd needs to save a. Likewise, autograd will save both c and d. Note that for your particular example, autograd does not need to
save _t1 nor _t2 (nor e). (It needs to know that _t1 was obtained
from a and b using *, but that is a much smaller amount of information
that is independent of the size of _t1.)
(Also note that the autograd system may sometimes save auxiliary
intermediate tensors that never appear explicitly in the forward pass
if they support the most efficient way to perform the backward pass.)
So yes, autograd stores various intermediate tensors for use in its
backward pass and these tensors are a big part of the memory cost.
But the details of which particular tensors it needs to store depend on
the specific details of your forward-pass computation.
Take a look at how one implements a custom autograd function for
further detail on how this all works.
Firstly, I just wanted to clarify my understanding of how autograd works for tensors/matrices . My understanding is that each scalar element in a tensor can be thought of as independent (that is the nodes it creates will naturally be different from the nodes the other values in the tensor creates). Hence (a*b + c*d) = e was my example where a,b are in the same row of the first matrix, c,d are in the same column of the second matrix and e is the result. Is my understanding accurate?
To compute the derivative of a * b with respect to a , autograd needs to save b . Similarly, to compute the derivative of a * b with respect tob , autograd needs to save a . Likewise, autograd will save both c and d . Note that for your particular example, autograd does not need to save _t1 nor _t2 (nor e ). (It needs to know that _t1 was obtainedfrom a and b using * , but that is a much smaller amount of information that is independent of the size of _t1 .)
I don’t really understand this. Using the same expression
_t1 = a * b
_t2 = c * d
e = _t1 + _t2
The derivative of e with respect to a which is needed for gradient descent for eg, is a function of both the local derivative (b) and the derivative of e with respect to _t1. In that case, doesn’t autograd need to store the intermediate value _t1 as well?
In that case, doesn’t autograd need to store the intermediate value _t1 as well?
Let’s emphasize on the particular word “need”: In the example that you’ve shared, there’s no need to store either of _t1 or _t2.
And why is that? Because the derivative of e wrt any (leaf) tensor a, b, c, or d does not require _t1 or _t2 as you can easily see:
de/da = b
and so on.
(Excuse my use of bad differentiation notation.)
To reinforce your understanding, let’s look at another example where
e = 5 _t1_t2 + 6 _t2
Now, here there’s a need to store _t1 or _t2 because:
de/da = de/d_t1 x d_t1/da = 5_t2 . b
which requires _t2 to be stored in the memory.
If it was clear till here, then think about this: What if we do not store _t2 in the memory?
Mathematically speaking, it should be just fine because _t2 = cd and as long as c and d are stored in the memory, you can always sucessfully “re-compute” _t2 and hence use it in the formula above.
What this means is that it’s a tradeoff between speed and memory: having _t2 stored in the memory would allow for a faster backprop as compared to re-calculating it during the backward pass.
Please also see activation checkpointing.