Why are my tensor's gradients unexpectedly None or not None?

tl;dr

Ensure that

  • tensor.is_leaf == True
  • tensor.requires_grad == True.
  • tensor.grad_fn is None; if it is not None, you need to retain_grad().
  • gradient computation is not disabled using
    • torch.no_grad() context manager
    • torch.autograd.set_grad_enabled(False)
  • you are not running any non-differentiable operation.

By default, Autograd populates gradients for a tensor t in t.grad only when t.is_leaf == True and t.requires_grad == True.

What is a leaf tensor?

Leaf tensors are tensors at the beginning of the computational graph, which means they are not the outputs of any differentiable operation. A model’s weights and biases, as well as any inputs to it, are all leaf tensors.

Outputs of hidden layers (activations) are not leaf tensors, because they are the result of a differentiable op (eg: matmul()). You can see the operation that generated a tensor in tensor.grad_fn

Read more about leaf tensors: What is the purpose of `is_leaf`?

But I need the gradients of intermediate outputs!

If a tensor is created from an operation that’s “differentiable” by Autograd - including operations like .to() which don’t look differentiable - it is not a leaf tensor and will not have gradients accumulated by default.

You can explicitly instruct Autograd to accumulate gradients for tensors by calling tensor.retain_grad() before calling .backward(). See this thread for an example: Method grad returns None for a tensor


!! Gotcha - Avoid using to() for nn.Parameters as they will be deregistered from the model

You will end up overwriting your leaf nn.Parameter with a non-leaf tensor.

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.param = nn.Parameter(torch.randn(1)).to(torch.float64)
model = MyModel()
print(dict(model.named_parameters())) # empty

Non-Differentiable operations

The output of non-differentiable operations will have requires_grad=False even if the inputs have requires_grad=True. Gradients cannot be computed for this operation, and you will see the error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. See this thread for an example: Custom loss function: gradients are None