Stuck on extracting grad() of loss

Using the C++ frontend, during training I compute a loss tensor with a construction similar to

auto x=module->forward(inputs);
auto loss=loss_function->compute(x,targets);
if(loss.requires_grad()) { // tests to true
loss.backward();
}

This is being done in a closure passed to an optimizer; this closure then returns the loss.

I wanted to preserve the gradient at this point for diagnostic purposes. I attempt to extract the gradient of the loss at this point using something similar to

grads=loss.grad().clone(); // with or without the clone()

In all cases, this seems to resolve to a gdb description of grads of the form

{impl_ = {target_ = some_address c10::UndefinedTensorImpl::_singleton}}

Any time I try to use this, I get a throw of the form

Expected a Tensor of type Variable but found an undefined Tensor…

Note that if I don’t try to grab the gradient at this or any point, the optimization proceeds just fine, so the necessary gradients are indeed being made available to the optimizer.
I am probably missing something about the semantics and/or proper use of the .grad() method. My understanding is that, after .backward(), this should return a “mutable reference” to the gradient of the tensor with respect to all the weights. (I’m hoping this curious phrasing doesn’t imply that it is “write only” and can only be used as the target of assignments.)
There is a const Tensor& grad() method prototype as well, which I have also tried and used in a successful compilation; the description of the action of this in the source is a little bit ambiguous. The behavior is essentially the same.
This has probably come up before, but what is the “standard” approach to grabbing loss gradients for future inspection?

Thanks,
Eric

Hi,

Few things:

  • Undefined Tensors are used in cpp to efficiently represent gradients full of 0s. So it is possible that you get an undefined Tensor here and for your optimization to be valid.
  • Just like in python, by default, the .grad() attribute is only saved for leaf Tensors. Otherwise, you will get an undefined Tensor. Is this Tensor a leaf? If not, you need to use .retain_grad() on it before calling .backward() to have its grad attribute populated.

Thanks for the quick response. Maybe I’m missing something, but I guess I assumed that, since the loss Tensor apparently carries gradient information usable by the optimizer (populated by .backward()), that this automatically populates the .grad() attribute. I also assumed that since the loss is the output of a full .forward() operation, that it was a leaf–but that could easily be because I’m not completely grasping the leaf concept.
In case any of the above assumptions are invalid, I can certainly use .retain_grad() on the loss before .backward(). I guess I also (perhaps naively) assumed that .retain_grad follows if .requires_grad tests true. I generally dig through the library source to try to clarify some of these things, but the autograd facility is a bit complex.

I think you see leafs the wrong way. A leaf Tensor is a Tensor that has no gradient history. In particular, a Parameter in your net is a Tensor with no history and so is identified as the Tensor that you’re currently learning.

The loss is at the complete other end. Since you call .backward() on it, all the gradient for a quantity foo will be dloss/dfoo. And so the gradient you will see for the loss will always be dloss/dloss = 1.
The ones that are interesting are the ones for the leafs like a weight w dloss/dw that shows you the direction this weights should go to reduce the loss.

Spot on, thanks. Invoking .retain_grad() on the loss does indeed create a single-element Tensor with value 1. I have some vestige of the leaf concept from graph theory (i.e. no children) in mind, which was messing me up–or it is valid, but my sense of “parenthood direction” is backwards.
Most definitely I am interested in d loss/dw, because it helps me visualize why an optimization might be getting stuck. So how would I resolve that for the loss? Do I actually pull that out of the back-end (i.e. weights), after .backward() is performed?
Excuse my naivete on this, but resolving it will go a long way toward calibrating my intuition. Thanks again.

Yes.

So maybe it’s simpler to consider the autograd.grad(loss, w) API first:
Given the output and input of a function, it will return the gradient dloss/dw to you.

In the context of torch.nn, to avoid the user the burden of moving these gradients around, we use loss.backward() that will populate a .grad field on the leafs (w here). So after loss.backward(), you can access w.grad() to get dloss/dw.
This .grad() field will then be used by the optimizer when it performs its step.