tl;dr
Ensure that
tensor.is_leaf == True
-
tensor.requires_grad == True
. - tensor.grad_fn is None; if it is not None, you need to
retain_grad()
. - gradient computation is not disabled using
-
torch.no_grad()
context manager torch.autograd.set_grad_enabled(False)
-
- you are not running any non-differentiable operation.
By default, Autograd populates gradients for a tensor t
in t.grad
only when t.is_leaf == True
and t.requires_grad == True
.
What is a leaf tensor?
Leaf tensors are tensors at the beginning of the computational graph, which means they are not the outputs of any differentiable operation. A model’s weights and biases, as well as any inputs to it, are all leaf tensors.
Outputs of hidden layers (activations) are not leaf tensors, because they are the result of a differentiable op (eg: matmul()
). You can see the operation that generated a tensor in tensor.grad_fn
Read more about leaf tensors: What is the purpose of `is_leaf`?
But I need the gradients of intermediate outputs!
If a tensor is created from an operation that’s “differentiable” by Autograd - including operations like .to()
which don’t look differentiable - it is not a leaf tensor and will not have gradients accumulated by default.
You can explicitly instruct Autograd to accumulate gradients for tensors by calling tensor.retain_grad()
before calling .backward()
. See this thread for an example: Method grad returns None for a tensor
!! Gotcha - Avoid using to()
for nn.Parameters
as they will be deregistered from the model
You will end up overwriting your leaf nn.Parameter
with a non-leaf tensor.
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.param = nn.Parameter(torch.randn(1)).to(torch.float64)
model = MyModel()
print(dict(model.named_parameters())) # empty
Non-Differentiable operations
The output of non-differentiable operations will have requires_grad=False
even if the inputs have requires_grad=True
. Gradients cannot be computed for this operation, and you will see the error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
. See this thread for an example: Custom loss function: gradients are None