All Tensors that have requires_grad which is False will be leaf Tensors by convention.
For Tensors that have requires_grad which is True , they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None.
Only leaf Tensors will have their grad populated during a call to backward(). To get grad populated for non-leaf Tensors, you can use retain_grad().
Hi I just started learning about autograd and I have some confusion about the PyTorch docs above, it says that if requires_grad is False then it is a leaf tensor and only leaf tensor will have their grad populated during the call to backward, but why a tensor’s grad is populated when it does not require gradient? What is the purpose of the is_leaf attribute?
Hi,
Thanks for the great link to the blog, I have read it and gained lots of insights about how autograd works but the article only explains the definition of leaf node but does not explain why it is needed, perhaps the author think it should be intuitive . It also does not answer some questions I had, such as:
>>> b = torch.rand(10, requires_grad=True).cuda()
>>> b.is_leaf
False
# b was created by the operation that cast a cpu Tensor into a cuda Tensor
>>> f = torch.rand(10, requires_grad=True, device="cuda")
>>> f.is_leaf
True
# f requires grad, has no operation creating it
Does the above mean that if we call cuda() on a tensor then we have to use retain_grad=True so that the grad attribute will be populated?
We need graph leaves to be able to compute gradients of final tensor w.r.t. them. Leaf nodes are not functions or simply put, have not been obtained from mathematical operations. For instance, in a nn.Linear(in, out) module, weight and bias are leaf nodes so when you call .backward on a loss function that uses this linear layer, gradient of loss function will be calculated w.r.t. the weights and bias. In the other words, all parameters of layers are leaf nodes.
Because .cuda() is an operation, then yes, you have use retain_grad=True.
Here is an example that might help:
conv = nn.Conv2d(1, 3, 2) # its params requires grad, we created it so it is a leaf
print(conv.weight.requires_grad) # true but its grad is None as it is a leaf
print(conv.weight.is_leaf) # true
z= conv(torch.randn(1, 1, 10, 10)).sum() # output of sum math operation so is not leaf but requires grad
print(z.is_leaf) # false
z.backward()
print(conv.weight.grad_fn) # none
print(z.grad_fn) # <SumBackward0 object>
Although, after reading few posts, some questions raised for me. I think we need to view some presentations about it (let’s google it!)
Thanks for the explanation! They are pretty clear but I still have some questions about it because the result is kind of weird to me:
# scenario 1
>>> a = torch.rand(3,2,requires_grad=True)
>>> b = torch.rand(2,3)
>>> loss = 10 - (a @ b).sum()
>>> loss.backward()
>>> a.grad
tensor([[-0.9418, -1.4957],
[-0.9418, -1.4957],
[-0.9418, -1.4957]]) # expected, has gradient
>>> b = torch.rand(2,3) # expected, no gradient
>>>
...
# scenario 2
>>> a = torch.rand(3,2,requires_grad=True).cuda()
>>> b = torch.rand(2,3).cuda()
>>> loss = 10 - (a @ b).sum()
>>> loss.backward()
>>> a.grad # expected since operation `cuda` makes it a non-leaf, but why?
>>> b.grad # expected, no gradient
>>>
...
# scenario 3
>>> ll = nn.Linear(3,3).cuda()
>>> inp = torch.rand(3).cuda()
>>> loss = (10 - ll(inp)).sum()
>>> loss.backward()
>>> ll.weight.grad
tensor([[-0.8497, -0.1128, -0.8081],
[-0.8497, -0.1128, -0.8081],
[-0.8497, -0.1128, -0.8081]], device='cuda:0') # why it has gradient?
As you see in scenario 2, by moving the tensor cuda I was not able to get the gradient without using a.retain_grad() or device='cuda'. In scenario 3 moving the layer to cuda the gradient of the weight is still there. May I know why it is designed this way? I know that result of an operation like multiplication does not requires its gradient to be calculated since intermidate values’s gradient usually not useful, but why does operation like cuda also cause the tensor to be non-leaf but stay as leaf for weight in layer?
About scenario 1, we are ok right?
But about case 2, I don’t know why it does not accumulate gradients even it computes for it. In docs, literally says The fact that gradients need to be computed for a Tensor do not mean that the grad attribute will be populated, see is_leaf for more details.(source)
But about case 3, logically, if we want to update weight and bias, we need to corresponding gradients, but about code, I am not sure because we can trick scenario 2 to be similar to scenario 3 using defining tensors directly into GPU instead of copying from CPU, in this way, case 2 will be a leaf and will have grad. Also, weight and bias n nn.Linear are parameters a specific case of tensors, that is, maybe that is the reason that enables this case.
a = torch.rand(3,2,requires_grad=True, device='cuda:0')
b = torch.rand(2,3).cuda()
loss = 10 - (a @ b).sum()
loss.backward()
print(a.is_leaf)
a.grad
# output
True
tensor([[-1.4735, -1.9717],
[-1.4735, -1.9717],
[-1.4735, -1.9717]], device='cuda:0')
I really need to study a few things, sorry for my lack of knowledge!
The docs don’t really indicate what is going on with is_leaf. In particular, I think the sentence that says “Only leaf Tensors will have their grad populated…” is misleading. From what I can guess, leaves don’t really have to do with populating grad; requires_grad is what governs that.
I think the is_leaf property is really about the reverse-graph. When x.backward() is called, all of the action happens on the “reverse-differentation mode” graph at x. x is the root, and the graph runs up along (against arrows in) the forward graph from x. While only tensors with requires_grad = True appear in the reverse-graph, the graph_fn of every tensor visited (even those with requires_grad = False) is used to produce the maps corresponding to the reverse-graph’s arrows. When the process hits a non-leaf, it knows it can keep mapping along to more nodes. On the other hand, when the process hits a leaf, it knows to stop; leaves have no graph_fn.
If this is right, it makes it more clear why weights are “leaves with requires_grad = True”, and inputs are “leaves with requires_grad = False.” You could even take this as a definition of “weights” and “inputs”.
Based on PyTorch’s design philosophy, is_leaf is not explained because it’s not expected to be used by the user unless you have a specific problem that requires knowing if a variable (when using autograd) was created by the user or not.
“If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Tensors didn’t require gradients.” – Autograd mechanics — PyTorch 1.8.1 documentation
a = torch.randn(2, 4)
b = torch.randn(4, 2)
c = a.mm(b)
tuple(map(lambda t: t.is_leaf, (a, b, c)))
...
# (True, True, True)
Here c is false (not a leaf) because it’s a tensor not “created” directly by the user (meaning it’s the result of a and b) andrequires_grad is true.
a = torch.randn(2, 4).requires_grad_()
b = torch.randn(4, 2)
c = a.mm(b)
tuple(map(lambda t: t.is_leaf, (a, b, c)))
...
# (True, True, False)
I think all your questions have been answered by others except for - " but why a tensor’s grad is populated when it does not require gradient?"
I assume that you mean why would a tensor’s grad be populated when it's requires_grad() is False?
The answer is that a tensor’s grad is not populated when it’s requires_grad() is False.
x = torch.tensor(1.0, requires_grad = True)
y = torch.tensor(2.0)
z = x * y
w = torch.tensor(3.0).requires_grad_(False)
o = z*w
o.backward()
for i, name in zip([x, y, z, w], “xyzw”):
print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\ngrad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\n")
In the above example you would see that w has is_leaf True but it’s requires_grad is False as a result its grad is None after applying the .backward().
Which means that requires_grad is more powerful than is_leaf and the gradients are not calculated w.r.t w even if is_leaf is True, because requires_grad is False.
This is how I intuitively understand leaf tensors: Leaf tensors are tensors that stop the flow of gradients on the backward pass. They can store gradients themselves. But they will not allow gradients to flow back through them to other tensors.
They are usually parameters and input tensors of the network. But they can also be tensors that are made from operations as long as they do not allow gradients to flow through them.
For example, if you have tensor A and B that are made from torch.ones(), they will be leaf tensors because they cannot let gradients to flow through them even if you set requires_grad to True. That’s because there are no other predecessor tensors to flow into. There are only A and B.
If you do C=A*B, then C will be leaf tensor if it’s not allowing gradients to flow to A and B.
C will not be leaf tensor if it’s allowing gradients to flow to A or B.
How can C allow gradients to flow to A or B? Just set either A or B to requires_grad=True.
If you set A.requires_grad = True then C will have to flow the gradients to A, making C not a leaf tensor anymore.
What’s the implication or use case of this knowledge?
When you see a leaf tensor, you will know that gradients will not flow through it. Backward pass stops right at this tensor. So GPU usage will be low.