Autograd python userspace code relies on internal state

Hi,

I have a simple beginer question about autograd.
Thanks for the nice library.
I really like the functionality and I understand that it simplifies a lot of code by packaging the derivative against a variable directly in the variable.

What I am fearing is that the .backward() => .grad mechanic relies on the user keeping in mind that there is a mutable internal state, .grad, that is linked with what previous code was executed, which tends to hurt readability.

Let’s assume I write this:

>>> w = torch.tensor([1., 1., 1.], requires_grad=True)
>>> func = w[0] ** 2 + 2 * w[1] + w[2]   # The loss function AST is computed
>>> func.backward()   # the derivatives of func against each tensor that was declared with required_true is computed and stored in the respective .grad internal states
>>> w.grad   # I can get the derivatives, handy
tensor([2., 2., 1.])

Now this means I should not write too much code between the .backward() and the .grad, because it is unclear which function it is the derivative of.
When I look in a codebase at a .grad, I do not know immediately what function was used to generate it.

>>> w = torch.tensor([1., 1., 1.], requires_grad=True)
>>> func = w[0] ** 2 + 2 * w[1] + w[2]
>>> func.backward()
>>> oups_had_forgotten = func + 3 * w[1]
>>> ... some code ...
>>> w.grad
tensor([2., 2., 1.])

If I have not been cautious about my backward, am I using the function func or oups here?
Also, if I wanted to compute another derivative, I need to .zero_() the tensor, which is another hard to read step.

I feel a more functional approach (in appearance, the backend could still behave like now) could be beneficial to try and learn users.
Consider something like:

>>> w.grad(func)
tensor([2., 2., 1.])
>>> w.grad(oups_had_forgotten)
tensor([2., 5., 1.])

This functional .grad would have the advantage of explicity and would not store the data in the tensor in place.
So, no need for .backward() and no need for .zero_() to clean the user exposed state.
Is there a function exposing such a behavior today? Would there be big downturns in terms of usability/performance?

Thank you

Hi,

The .backward() and .grad fields are built to work nicely with the torch.nn and torch.optim libraries.
It allows you to handle computing the gradients and updating all the parameters without the user having to handle passing all the gradients by themselves.

If you are not working with these libraries, I would recommend using torch.autograd.grad(out, inp) that will return the gradient for each input you give.
This will do what you want here right?

Thanks for your reply @albanD!
Indeed you are correct this is the function I had in mind.
From reading the documentation I did not understand clearly its puprose (maybe an example would have helped)

I just tried to apply my mental flow with it and it works. The defaults do not allow for an easy try and learn though since they do not retain the graph:

>>> w = torch.tensor([1., 1., 1.], requires_grad=True)
>>> func = w[0] ** 2 + 2 * w[1] + w[2]
>>> oups_had_forgotten = func + 3 * w[1]
>>> torch.autograd.grad(oups_had_forgotten, w)
(tensor([2., 5., 1.]),)
>>> torch.autograd.grad(func, w)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\remi\.virtualenvs\zerotogan--y2dYRx8\lib\site-packages\torch\autograd\__init__.py", line 158, in grad
    inputs, allow_unused)
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Is the aggressive freeing of buffers a performance tweek?

But if I choose to retain the graph at each step then I can at least do some retry:

>>> w = torch.tensor([1., 1., 1.], requires_grad=True)
>>> func = w[0] ** 2 + 2 * w[1] + w[2]
>>> oups_had_forgotten = func + 3 * w[1]
>>> torch.autograd.grad(func, w, retain_graph=True)
(tensor([2., 2., 1.]),)
>>> torch.autograd.grad(oups_had_forgotten, w, retain_graph=True)
(tensor([2., 5., 1.]),)

Thanks !

Indeed you are correct this is the function I had in mind.

Perfect!

Is the aggressive freeing of buffers a performance tweek?

It is a very important memory improvement yes.
In a neural network, most of the memory usage comes from the buffer. So freeing the buffers as you do the backward allows you to significantly reduce the peak memory usage.

1 Like