tl;dr
Ensure that
tensor.is_leaf == True-
tensor.requires_grad == True. - tensor.grad_fn is None; if it is not None, you need to
retain_grad(). - gradient computation is not disabled using
-
torch.no_grad()context manager torch.autograd.set_grad_enabled(False)
-
- you are not running any non-differentiable operation.
By default, Autograd populates gradients for a tensor t in t.grad only when t.is_leaf == True and t.requires_grad == True.
What is a leaf tensor?
Leaf tensors are tensors at the beginning of the computational graph, which means they are not the outputs of any differentiable operation. A model’s weights and biases, as well as any inputs to it, are all leaf tensors.
Outputs of hidden layers (activations) are not leaf tensors, because they are the result of a differentiable op (eg: matmul()). You can see the operation that generated a tensor in tensor.grad_fn
Read more about leaf tensors: What is the purpose of `is_leaf`?
But I need the gradients of intermediate outputs!
If a tensor is created from an operation that’s “differentiable” by Autograd - including operations like .to() which don’t look differentiable - it is not a leaf tensor and will not have gradients accumulated by default.
You can explicitly instruct Autograd to accumulate gradients for tensors by calling tensor.retain_grad() before calling .backward(). See this thread for an example: Method grad returns None for a tensor
!! Gotcha - Avoid using to() for nn.Parameters as they will be deregistered from the model
You will end up overwriting your leaf nn.Parameter with a non-leaf tensor.
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.param = nn.Parameter(torch.randn(1)).to(torch.float64)
model = MyModel()
print(dict(model.named_parameters())) # empty
Non-Differentiable operations
The output of non-differentiable operations will have requires_grad=False even if the inputs have requires_grad=True. Gradients cannot be computed for this operation, and you will see the error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. See this thread for an example: Custom loss function: gradients are None