All Tensors that have requires_grad which is False will be leaf Tensors by convention.

For Tensors that have requires_grad which is True , they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None.

Only leaf Tensors will have their grad populated during a call to backward(). To get grad populated for non-leaf Tensors, you can use retain_grad().

Hi I just started learning about autograd and I have some confusion about the PyTorch docs above, it says that if requires_grad is False then it is a leaf tensor and only leaf tensor will have their grad populated during the call to backward, but why a tensorâ€™s grad is populated when it does not require gradient? What is the purpose of the is_leaf attribute?

Hi,
Thanks for the great link to the blog, I have read it and gained lots of insights about how autograd works but the article only explains the definition of leaf node but does not explain why it is needed, perhaps the author think it should be intuitive . It also does not answer some questions I had, such as:

>>> b = torch.rand(10, requires_grad=True).cuda()
>>> b.is_leaf
False
# b was created by the operation that cast a cpu Tensor into a cuda Tensor
>>> f = torch.rand(10, requires_grad=True, device="cuda")
>>> f.is_leaf
True
# f requires grad, has no operation creating it

Does the above mean that if we call cuda() on a tensor then we have to use retain_grad=True so that the grad attribute will be populated?

We need graph leaves to be able to compute gradients of final tensor w.r.t. them. Leaf nodes are not functions or simply put, have not been obtained from mathematical operations. For instance, in a nn.Linear(in, out) module, weight and bias are leaf nodes so when you call .backward on a loss function that uses this linear layer, gradient of loss function will be calculated w.r.t. the weights and bias. In the other words, all parameters of layers are leaf nodes.

Because .cuda() is an operation, then yes, you have use retain_grad=True.

Here is an example that might help:

conv = nn.Conv2d(1, 3, 2) # its params requires grad, we created it so it is a leaf
print(conv.weight.requires_grad) # true but its grad is None as it is a leaf
print(conv.weight.is_leaf) # true
z= conv(torch.randn(1, 1, 10, 10)).sum() # output of sum math operation so is not leaf but requires grad
print(z.is_leaf) # false
z.backward()
print(conv.weight.grad_fn) # none
print(z.grad_fn) # <SumBackward0 object>

Although, after reading few posts, some questions raised for me. I think we need to view some presentations about it (letâ€™s google it!)

Thanks for the explanation! They are pretty clear but I still have some questions about it because the result is kind of weird to me:

# scenario 1
>>> a = torch.rand(3,2,requires_grad=True)
>>> b = torch.rand(2,3)
>>> loss = 10 - (a @ b).sum()
>>> loss.backward()
>>> a.grad
tensor([[-0.9418, -1.4957],
[-0.9418, -1.4957],
[-0.9418, -1.4957]]) # expected, has gradient
>>> b = torch.rand(2,3) # expected, no gradient
>>>
...
# scenario 2
>>> a = torch.rand(3,2,requires_grad=True).cuda()
>>> b = torch.rand(2,3).cuda()
>>> loss = 10 - (a @ b).sum()
>>> loss.backward()
>>> a.grad # expected since operation `cuda` makes it a non-leaf, but why?
>>> b.grad # expected, no gradient
>>>
...
# scenario 3
>>> ll = nn.Linear(3,3).cuda()
>>> inp = torch.rand(3).cuda()
>>> loss = (10 - ll(inp)).sum()
>>> loss.backward()
>>> ll.weight.grad
tensor([[-0.8497, -0.1128, -0.8081],
[-0.8497, -0.1128, -0.8081],
[-0.8497, -0.1128, -0.8081]], device='cuda:0') # why it has gradient?

As you see in scenario 2, by moving the tensor cuda I was not able to get the gradient without using a.retain_grad() or device='cuda'. In scenario 3 moving the layer to cuda the gradient of the weight is still there. May I know why it is designed this way? I know that result of an operation like multiplication does not requires its gradient to be calculated since intermidate valuesâ€™s gradient usually not useful, but why does operation like cuda also cause the tensor to be non-leaf but stay as leaf for weight in layer?

About scenario 1, we are ok right?
But about case 2, I donâ€™t know why it does not accumulate gradients even it computes for it. In docs, literally says The fact that gradients need to be computed for a Tensor do not mean that the grad attribute will be populated, see is_leaf for more details.(source)

But about case 3, logically, if we want to update weight and bias, we need to corresponding gradients, but about code, I am not sure because we can trick scenario 2 to be similar to scenario 3 using defining tensors directly into GPU instead of copying from CPU, in this way, case 2 will be a leaf and will have grad. Also, weight and bias n nn.Linear are parameters a specific case of tensors, that is, maybe that is the reason that enables this case.

a = torch.rand(3,2,requires_grad=True, device='cuda:0')
b = torch.rand(2,3).cuda()
loss = 10 - (a @ b).sum()
loss.backward()
print(a.is_leaf)
a.grad
# output
True
tensor([[-1.4735, -1.9717],
[-1.4735, -1.9717],
[-1.4735, -1.9717]], device='cuda:0')

I really need to study a few things, sorry for my lack of knowledge!

The docs donâ€™t really indicate what is going on with is_leaf. In particular, I think the sentence that says â€śOnly leaf Tensors will have their grad populatedâ€¦â€ť is misleading. From what I can guess, leaves donâ€™t really have to do with populating grad; requires_grad is what governs that.

I think the is_leaf property is really about the reverse-graph. When x.backward() is called, all of the action happens on the â€śreverse-differentation modeâ€ť graph at x. x is the root, and the graph runs up along (against arrows in) the forward graph from x. While only tensors with requires_grad = True appear in the reverse-graph, the graph_fn of every tensor visited (even those with requires_grad = False) is used to produce the maps corresponding to the reverse-graphâ€™s arrows. When the process hits a non-leaf, it knows it can keep mapping along to more nodes. On the other hand, when the process hits a leaf, it knows to stop; leaves have no graph_fn.

If this is right, it makes it more clear why weights are â€śleaves with requires_grad = Trueâ€ť, and inputs are â€śleaves with requires_grad = False.â€ť You could even take this as a definition of â€śweightsâ€ť and â€śinputsâ€ť.

Based on PyTorchâ€™s design philosophy, is_leaf is not explained because itâ€™s not expected to be used by the user unless you have a specific problem that requires knowing if a variable (when using autograd) was created by the user or not.

â€śIf thereâ€™s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs donâ€™t require gradient, the output also wonâ€™t require it. Backward computation is never performed in the subgraphs, where all Tensors didnâ€™t require gradients.â€ť â€“ Autograd mechanics â€” PyTorch 1.8.1 documentation

a = torch.randn(2, 4)
b = torch.randn(4, 2)
c = a.mm(b)
tuple(map(lambda t: t.is_leaf, (a, b, c)))
...
# (True, True, True)

Here c is false (not a leaf) because itâ€™s a tensor not â€ścreatedâ€ť directly by the user (meaning itâ€™s the result of a and b) andrequires_grad is true.

a = torch.randn(2, 4).requires_grad_()
b = torch.randn(4, 2)
c = a.mm(b)
tuple(map(lambda t: t.is_leaf, (a, b, c)))
...
# (True, True, False)

I think all your questions have been answered by others except for - " but why a tensorâ€™s grad is populated when it does not require gradient?"

I assume that you mean why would a tensorâ€™s grad be populated when it's requires_grad() is False?

The answer is that a tensorâ€™s grad is not populated when itâ€™s requires_grad() is False.

x = torch.tensor(1.0, requires_grad = True)
y = torch.tensor(2.0)
z = x * y

w = torch.tensor(3.0).requires_grad_(False)
o = z*w

o.backward()

for i, name in zip([x, y, z, w], â€śxyzwâ€ť):
print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\ngrad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\n")

In the above example you would see that w has is_leaf True but itâ€™s requires_grad is False as a result its grad is None after applying the .backward().
Which means that requires_grad is more powerful than is_leaf and the gradients are not calculated w.r.t w even if is_leaf is True, because requires_grad is False.