How does Pytorch handle in-place operations without losing information necessary for backpropagation?

lesserfish · February 16, 2025, 4:58am

I’ve seen several questions about in-place operations regarding their efficiency, but I’m actually more confused about the inner-workings of pytorch.

Let’s take a simple example like

a = torch.randn(10)
b = torch.randn(10)
c = torch.randn(10)

x1 = a * b
x2 = x1 * c

In this case, things are easy. Backpropagations happens like this:

x2.grad <- 1
c.grad <- x2.grad * x1 = x1 = a * b
x1.grad <- x2.grad * c = c
b.grad <- x1.grad * a = c * a
a.grad <- x1.grad * b = c * b

Everything works correctly. However, in this scenario we have are allocating two buffers: x1 and x2. Now, what happen when we do something like this:

x = a * b
x = x * c

It seems to me that the overall expression is the same. However, if we try to compute the gradients the same way we did before, we will run into the following problem:

x.grad <- 1
c.grad <- x.grad * x = x = a * b * c

Uh oh. we already got a mistake. Since we performed the multiplication with c in place, we lost the buffer containing a * b which was needed in order to calculate the gradient of c. Does Pytorch actually have buffers for every intermediate calculation?

I can imagine of some possible solutions to this

Every intermediate node actually has its own grad. I think this is likely not the answer as it would require a shit ton of memory
Every intermediate step actually uses temporary buffers? If so, how is allocation / deallocation of temporary buffers handled?

How is this kind of problem solved in modern frameworks?

KFrank · February 16, 2025, 2:49pm

Hi Lesser!

The short story is that you are not performing an in-place operation.

x = a * b creates a new pytorch tensor equal to a * b and creates a new (or maybe
reuses an existing) python reference x and sets it to refer to the a * b tensor. Then
x = x * c creates a new pytorch tensor equal to x * c and sets the python reference
x to refer to that new tensor. Although the python reference x now refers to what I
will call the x * c tensor, the a * b tensor still exists, although it is no longer referred
to by the python reference x.

If this is part of a forward pass that autograd is tracking, autograd keeps its own
reference to the a * b tensor (if needed) for use in the backward pass.

(If you had wanted to modify the a * b tensor in place, you could use, for example,
x.mul_ (c) at the point where x still refers to the a * b tensor.)

Best.

K. Frank

lesserfish · February 16, 2025, 3:20pm

Thank you very much for your explanation, Frank!

So it seems that doing

x = a * b
x = x * c

Is not at all different than doing

x1 = a * b
x2 = x1 * c

correct?

I just have another question, if you wouldn’t mind. In the scenario described above, both x1 and x2 are non-leaf tensors correct?

I was under the impression that non-leaf tensors do not gradient buffers associated to them. So how does backpropagation happen? For example, previously I thought it was like this, but:

x2.grad <- 1  // This can't happen as x1.grad does not exist.
c.grad <- x2.grad * x1 = x1 = a * b; // x2.grad also does not exist
x1.grad <- x2.grad * c = c // and so on
b.grad <- x1.grad * a = c * a
a.grad <- x1.grad * b = c * b

Am I misunderstanding the idea that non-leaf tensors do not have gradients? Do they use temporary buffers for backpropagation?

Thank you very much!

KFrank · February 17, 2025, 12:27am

Hi Lesser!

Yes, they are basically the same. Note, in the first case, the a * b tensor no longer
has a reference referring to it, so (unless something else refers to it, such as a
reference used internally by autograd) it will be freed. In the second case, in contrast,
x1 continues to refer to the a * b tensor, so it won’t be freed (until x1 goes out of
scope, or some such).

Correct.

When you backpropagate (with the default processing), autograd computes, for
example, the gradient of some loss with respect tox2 and then passes that gradient
up the computation graph (the backpropagation chain) in order to compute the
gradient of the loss with respect to x1. When it’s done using the gradient with
respect to x2, that gradient gets freed.

Non-leaf tensors can, in fact, have their gradients stored in a .grad property during
backpropagation. For example, calling x2.retain_grad() before backpropagation
will cause autograd to store the gradient with respect to x2 in x2.grad.

(This is as if .retain_grad() is called by default on leaf tensors and not on non-leaf
tensors.)

The forward pass / backward pass is very memory hungry for big models, so you really
do want those intermediate gradients to be freed as soon as they can be – hence this
default behavior of only keeping gradients for leaf tensors.

Best.

K. Frank