Autograd and Temporary Variables

Vaker · May 9, 2025, 4:29am

I’m fairly new to PyTorch, and I’m running into some questions about how autograd will interact with some parts of my code which I haven’t been able to find answers to.

  def forward(self,theta: Tensor) -> Tensor: 
    #theta is a (B,F,L) tensor, where B is a batch dimension
    th0 = theta[:,:,0].view(self.b,5,1,1,1)
    th1 = theta[:,:,1].view(self.b,5,1,1,1)
    th2 = theta[:,:,2].view(self.b,5,1,1,1)
    th3 = theta[:,:,3].view(self.b,5,1,1,1)
    
    #frames is a 5-tensor of shape (B,F,L,4,4)
    #in the start state, all of the (4,4) blocks are identity
    frames = self.start_state.clone()
    
    #swivel is a 5-tensor of shape (B,F,L,4,4)
    S = self.swivel.clone()
    
    S[...,:3,:3] = self.I + torch.sin(th0)*S[...,:3,:3] + \
      (1-torch.cos(th0))*(S[...,:3,:3]@S[...,:3,:3])
      
    S.requires_grad = True
    S.retain_grad()
    
    frames = S@frames

I had a few questions about this. After some similar further transformations, the frames tensor (or a particular slice of it) will be the output.

S here is a temporary derived from a member variable of my class- from some reading I got the impression that S.retain_grad() would allow the gradients of S to be maintained for the backward pass, but is this correct? If I returned frames at this step and called a backward pass, it ‘works’, but I’m not sure if it’s actually computing gradients through S as I would like it to.
If I put the two lines enabling grad for S before the operation on the 3x3 slice of its last two dimensions, PyTorch gives an error saying an in-place operation was called on a leaf of the compute graph. This makes sense, and the current placement raises no errors, but I’m left wondering if putting those lines after that op is actually fixing the issue, or if it will still interact weirdly with the autograd engine?
I’m not very sure about how editing/multiplying slices of tensors affects gradient computation, if at all. For example, the next transformations to be applied to frames are:

# curl is a (B,F,L-1,4,4) tensor
C1 = self.curl.clone() 
  ....
    
C2 = self.curl[:,:,:2,...].clone()
  ....
frames[:,:,1:,:,:] = C1@frames[:,:,1:,:,:]
frames[:,:,2:,:,:] = C2@frames[:,:,2:,:,:]

#I've cut out a few lines computing blocks of C1, C2 and enabling grad here, 
#they're almost identical to what was done for S

My intent is to only apply the transformations of C1, C2 on sections of frames in a differentiable manner- but will this work, or is there a nuance I’m missing?

Any help would be greatly appreciated! I’ve not worked with PyTorch internals much at all (although I’m slowly growing through the tutorials and docs), so any detail might help.

(P.S. - if you have stylistic comments, I’d be glad to hear them as well. I have tried to do things cleanly where possible, but I’m not sure that I’ve succeeded whatsoever.)

KFrank · May 9, 2025, 6:40pm

Hi Vaker!

Vaker:

  def forward(self,theta: Tensor) -> Tensor: 
    #theta is a (B,F,L) tensor, where B is a batch dimension
    th0 = theta[:,:,0].view(self.b,5,1,1,1)
    ...
    frames = self.start_state.clone()
    
    #swivel is a 5-tensor of shape (B,F,L,4,4)
    S = self.swivel.clone()
    
    S[...,:3,:3] = self.I + torch.sin(th0)*S[...,:3,:3] + \
    ...
    S.requires_grad = True
    S.retain_grad()

It’s not clear what you are trying to do or what is going on here.

Some specific questions:

Is autograd supposed to be tracking gradients for theta? Specifically, does the theta
passed into forward() carry requires_grad = True, as would be the case if it were the
output of some previous layer with trainable parameters?

Does self.start_state carry requires_grad = True?

Does swivel carry requires_grad = True? (Presumably not.)

If I understand what you are asking, not exactly. S.requires_grad = True would tell
autograd to track gradients for S (while S.retain_grad() – which you would typically
not use – does something rather different).

Autograd will not compute gradients “through” S. It will only compute gradients back up
to (and including) S, but not beyond. This is because, as your code stands, S is a
so-called “leaf variable” – the point at which autograd begins tracking gradients during
the forward pass.

I deduce (but without the relevant code cannot verify) that prior to the statement
S.requires_grad = True, S.requires_grad has the value False. Therefore, S is
not yet a leaf variable of the computation graph, so modifying it inplace by executing
the line S[...,:3,:3] = ... is allowed. But when you move S.requires_grad = True
to before S[...,:3,:3] = ..., S now is already a leaf variable prior to your attempted
inplace modification, so you get the error.

It won’t fix the issue that I imagine you are thinking about (but I don’t really know what
you are trying to do). Concretely, gradients from frames = S @ frames will not flow
back up through S to self.swivel because (as written) S is the leaf variable, so
that’s where the backward pass will stop.

(As an aside, if self.start_state carries requires_grad = True, gradients will
flow back up through the frames branch of the computation graph. The fact that S
is a leaf variable only blocks backpropagation back up to variables from which it was
computed – not other parts of the computation graph, such as the frames branch.)

Best.

K. Frank

Vaker · May 9, 2025, 9:18pm

Hi Frank, thanks for your reply!

I apologize for the lack of clarity in my question- I tried to post a minimal code example to illustrate the types of operations I was confused about, but in retrospect I can see how I’ve left rather too much for the reader to infer.

For some context, this function is intended to implement batched forward kinematics for a particular class of parallel robotic manipulators. The theta tensor can be interpreted as a batch of joint (angle) configurations. In my application, the joint angles are predicted by a neural network, which I’ll call net. Everything stored in self can be viewed as an expression of the structure of the manipulator; so they’re all constant. I’ll give more detail below.

(Now answering your specific questions)

You’re correct about theta- it’s an output from net, and the output of the forward function here is intended for use in a loss for training net- so it has grad enabled.
self.start_state is a collection of constant transforms defining a neutral position of the robot; it has requires_grad = False.
self.swivel is just an expansion of a single constant 4x4 matrix- if that matrix is A, then
self.swivel = A.view(1,1,1,4,4).expand(B,F,L,-1,-1), where B, F, L are the same dimensions as for theta in my original post.
In my code it’s used to generate the (theta-dependent) transform S. So swivel has requires_grad = False because it’s a constant, and I’m assuming that the multiplication by elements of theta makes S differentiable- but I’m not sure if that’s true.
Working in the simple case B = F = L = 1, S = I + sin(theta)*A + (1-cos(theta))*A@A, and I’d ideally want S' = cos(theta)*A + sin(theta)*A@A

Replying to the rest of your message:

If I understand what you are asking, not exactly. S.requires_grad = True would tell
autograd to track gradients for S (while S.retain_grad() – which you would typically
not use – does something rather different).

Ah, alright- I read up on .retain_grad() again, and after checking S.is_leaf() I can see that it’s not necessary. I had thought that temporary variables needed to be explicitly added as leaves of the computation graph, but that was probably not giving enough credit to the pytorch folks.

Autograd will not compute gradients “through” S. It will only compute gradients back up
to (and including) S, but not beyond. This is because, as your code stands, S is a
so-called “leaf variable” – the point at which autograd begins tracking gradients during
the forward pass.

Right, I think that make sense. As I expanded above, a slice of the output of this forward (essentially just frames matrix multiplied by a sequence of transformations, the first of which is S) will be put through a simple, differentiable loss function- at that point, when I call loss.backward(), would it be correct to say that autograd will compute gradients through S? Or am I still misunderstanding?

Yes, your deductions are absolutely right, and this explanation makes a lot of sense.

It won’t fix the issue that I imagine you are thinking about (but I don’t really know what
you are trying to do). Concretely, gradients from frames = S @ frames will not flow
back up through S to self.swivel because (as written) S is the leaf variable, so
that’s where the backward pass will stop.

With the new context that self.swivel is constant, am I right to say that this part does not matter? I only need gradients to flow back through theta for the training of net- will that actually work? If my statement that S is no longer a leaf after the loss computation is correct, then I think it will? (I apologize again here- I cannot find a way to state this more precisely.)

Huh, that’s not quite how I thought it worked. I had thought that since the self.start_state is constant, self.start_state.requires_grad = False would just treat it as a constant in gradient calculation, rather than not having gradients flow through it at all.

I think all in all I need to do loads more reading into how autograd actually works, alongside some practice. My past projects have been simple enough where this sort of thing hasn’t been an issue, but I can’t see myself doing anything particularly sophisticated with so much missing knowledge.

Thanks again for your response- it was really very helpful, and it’s given me a good number of avenues for learning. If you have any further comments, I’d be glad to hear them!

Best regards,
Vaker

KFrank · May 10, 2025, 4:09pm

Hi Vaker!

Let me make a couple of comments:

First, you should play around with some simple computations (e.g., w = x + y * z)
and track how things like requires_grad = True evolve through the intermediate
results, depending on which of the initial variables start out with requires_grad = True.

There is a package – that I have never used – called torchviz that will supposedly help
you visualize computation graphs. You might play around with this to get a feel for what
computation graphs look like in practice.

Let me add some clarification to my mention of .is_leaf: I’m not sure how .is_leaf
is used in pytorch – to me, it means “has the potential to become the leaf of an actual
computation graph.” I say this because .is_leaf seems to be True for tensors that
have requires_grad = False, even if they haven’t been used in any computation.

Consider:

>>> import torch
>>> torch.__version__
'2.7.0+cu128'
>>> t1 = torch.ones (3)
>>> s1 = t1 * t1
>>> t1.requires_grad
False
>>> t1.is_leaf
True
>>> s1.requires_grad
False
>>> s1.is_leaf
True
>>> t2 = torch.ones (3, requires_grad = True)
>>> s2 = t2 * t2
>>> t2.requires_grad
True
>>> t2.is_leaf
True
>>> s2.requires_grad
True
>>> s2.is_leaf
False

That is, a tensor that has requires_graph = False isn’t (in my mind) a full-fledged
member – through which gradients flow – of any actual computation graph (leaf or not),
but it has .is_leaf = True. (Note, in c = 1.2, y = c * x, we could call c a leaf
of the computation graph because it is needed to compute the gradient of y with respect
to x. But c, itself, doesn’t have any .grad and no gradients flow through it.) Just
something to bear in mind if you’re using .is_leaf to track computation graphs.

Going back to the code you posted initially:

As I understand you, the theta passed in has requires_grad = True, while start_state
and swivel do not. So at this point, neither frames nor S has requires_grad = True.

However, th0 does have requires_grad = True, so after the above-quoted line, S
will, as well.

As I read your code, S.requires_grad = True does nothing (as .requires_grad was
already True) (and S.retain_grad() probably doesn’t do anything you want).

Finally, frames = S@frames creates a new tensor (that happens to use the same
python reference name, “frames,” that used to refer to the “old” frames). This new
frames will have requires_grad = True because S did (even though the old
frames didn’t).

Yes, this is true. S should be “differentiable” in that it carries requires_grad = True.

Let me correct what I said in my earlier post. Based on what you’ve said, after
S[...,:3,:3] = self.I + torch.sin(th0)*S[...,:3,:3] + ..., S will
have requires_grad = True and depend on th0 and thus on theta. So if
you compute your loss value using the (new) frames, calling .backward() on
that loss value will backpropagate gradients back up through S to theta (as I
understand is what you want).

Given that S already has requires_grad = True, S.requires_grad = True
is a no-op, and you can (and stylistically should) leave it out.

In my usage, S was never a leaf variable (in the sense of being a leaf of some
actual computation graph). Immediately after S = self.swivel.clone(), S does
have .is_leaf = True, but because is also has requires_grad = False, I don’t
count is as a “real” leaf variable.

Then after S[...,:3,:3] = self.I + torch.sin(th0)*S[...,:3,:3] + ... S
becomes a non-leaf interior node of the computation graph (through which gradients
will flow).

Consider:

>>> t = torch.ones (3, requires_grad = True)
>>> s = torch.ones (3)
>>> s.requires_grad
False
>>> s[1] = t[2]
>>> s.requires_grad
True
>>> s.is_leaf
False

Note, computing a loss value from (the new) frames doesn’t change any of this. S
is still an interior node of the graph – computing the loss just adds a new “root” to
the “head” of the graph.

Best.

K. Frank

Vaker · May 12, 2025, 8:30pm

Hi Frank! Sorry for the delayed reply, had an unusually packed weekend.

i) I’ll take a look at torchviz- if it works as advertised, it seems as though it could be a good tool to understand the construction of torch computational graphs in situations where I’m unsure of how it will work.
ii) The point on .is_leaf is surprising- I can certainly see myself using it incorrectly and getting confused if you hadn’t pointed it out.

To the rest of your reply, your comments on the position of S in the graph structure make a lot of sense, and taking those alongside the examples you provided and my own reading I don’t think I have any follow up questions. (I’ve also verified on my end that everything works as you stated, and updated my code to remove non-operations.)

Thank you again for your detailed responses and explanations to my questions! They cleared up a good number of autograd pain points for me which might otherwise have taken a lot more work to understand.

(This response feels rather short, in contrast; I want to emphasize that this is because I really have no further questions, and I feel that this exchange has given me a solid starting point to learn more from on my own.)