tl;dr
Ensure that
tensor.is_leaf == True

tensor.requires_grad == True
.  tensor.grad_fn is None; if it is not None, you need to
retain_grad()
.  gradient computation is not disabled using

torch.no_grad()
context manager torch.autograd.set_grad_enabled(False)

 you are not running any nondifferentiable operation.
By default, Autograd populates gradients for a tensor t
in t.grad
only when t.is_leaf == True
and t.requires_grad == True
.
What is a leaf tensor?
Leaf tensors are tensors at the beginning of the computational graph, which means they are not the outputs of any differentiable operation. A model’s weights and biases, as well as any inputs to it, are all leaf tensors.
Outputs of hidden layers (activations) are not leaf tensors, because they are the result of a differentiable op (eg: matmul()
). You can see the operation that generated a tensor in tensor.grad_fn
Read more about leaf tensors: What is the purpose of `is_leaf`?
But I need the gradients of intermediate outputs!
If a tensor is created from an operation that’s “differentiable” by Autograd  including operations like .to()
which don’t look differentiable  it is not a leaf tensor and will not have gradients accumulated by default.
You can explicitly instruct Autograd to accumulate gradients for tensors by calling tensor.retain_grad()
before calling .backward()
. See this thread for an example: Method grad returns None for a tensor
!! Gotcha  Avoid using to()
for nn.Parameters
as they will be deregistered from the model
You will end up overwriting your leaf nn.Parameter
with a nonleaf tensor.
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.param = nn.Parameter(torch.randn(1)).to(torch.float64)
model = MyModel()
print(dict(model.named_parameters())) # empty
NonDifferentiable operations
The output of nondifferentiable operations will have requires_grad=False
even if the inputs have requires_grad=True
. Gradients cannot be computed for this operation, and you will see the error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
. See this thread for an example: Custom loss function: gradients are None