What is the difference between autograd.backward() and autograd.grad()

Kevin96 · March 29, 2020, 4:01am

It seems like both functions can be used to compute gradient. In fact, in their source code they call the same function_execution_engine.run_backward().

So what is the difference of these functions in terms of use cases? I would assume autograd.grad is more general?

albanD · March 30, 2020, 2:08pm

Hi,

.backward() is very convenient when working with torch.nn because it does not require the user to specify which Tensors we want the gradients for (and the optimizer knows where to look for these gradients).
.grad() requires you to specify the Tensors you want the gradients for and won’t populate the .grad field.

You could re-implement one with the other but they are just slightly different APIs for different use cases.

Sudarshan_Babu · September 8, 2020, 2:54pm

Is there any reason for these functions to yield different gradients ?

I have been getting slightly different gradients, when everything else is maintained constant.

Any help would be much appreciated, thanks !

albanD · September 8, 2020, 2:56pm

Hi,

If the .grad fields are not full of zero before you call .backward() then you can end up with different values.
Also what do you mean by “slightly different” exactly?

Sudarshan_Babu · September 8, 2020, 4:00pm

Hi,

Thank you for your quick response!

I figured that the difference was having create_graph set to True while using .backward(), and set to False while using .grad()

My understanding of create_graph is that it enables differentiation through the gradient computation step. I dont understand why this should affect the values of the computed gradients, any help clarifying this would be much appreciated. Thank you.

albanD · September 8, 2020, 4:43pm

Hi,

Again it depends how big the difference. If the difference is at the level of numerical precision (or one or two orders of magnitude larger if you test a full network), then this is because create_graph forces us to use a backward that is differentiable.
And because floating point arithmetic is not associative, using different implementations can lead to bit-wise difference in the result. These differences are then amplified by the later computations in the backward pass.

Sudarshan_Babu · September 8, 2020, 5:05pm

Thanks for your explanation
Below I have given the gradients computed with and without create_graph.

[ 17.8867,  21.4154, -18.7954,  ...,  -8.7862, -14.6281,  -4.4456] #computed with create_graph = False
[ 19.2660,  24.0623, -22.0106,  ..., -11.3593, -16.9430,  -4.7851] # computed with  create_graph = True

Could numerical precision lead to such a difference ?

albanD · September 8, 2020, 5:08pm

They look quite correlated, so for a very deep network, it is not impossible. But quite improbable indeed.

Can you share the structure of your network? Or even better, a small test script that we can use to reproduce the issue?

Kevin96 · September 8, 2020, 5:17pm

create_graph is usually false unless you want to do second order differentiation.

Sudarshan_Babu · September 8, 2020, 5:36pm

@albanD Thank you for helping me out, would you mind if I share a github link with the minimal example ?

I am using a model where the first layer is the bert embedding layer, and its attached to a classifier. The gradients for the classifier however match with and without create_graph

Sudarshan_Babu · September 8, 2020, 5:36pm

@Kevin96, thanks, but my use case requires that I compute a higher order derivative

albanD · September 8, 2020, 7:31pm

Sure, but we need to be able to easily run it and it shouldn’t take long
If you just use the bert embedding you see the same behavior?

Sudarshan_Babu · September 9, 2020, 12:34am

Hey !

I have the minimal example in this repo:https://github.com/sudarshan1994/create_graph_debug

Its easy to run : python test.py and does not call any weird libraries haha !

The pre-trained model is also uploaded in the repo, and its essential for seeing the discrepancy. For randomly initialized models we dont see the discrepancy.

The whole thing takes a minute to run ! All the relevant action takes place between line 146 to line 170, the rest is just setting up the Bert model.

Thanks a lot for agreeing to take a look at the code.

Sudarshan_Babu · September 11, 2020, 2:57pm

Hey @albanD,

Were you able to look at it, or was the code too clunky (sorry if so! )?

Thanks !

albanD · September 12, 2020, 3:11pm

Didn’t had time yet. I hopefully will have a bit of time this week end

Sudarshan_Babu · September 12, 2020, 3:12pm

Thank you so much !!

albanD · September 13, 2020, 6:06pm

Hey,

After doing the change to run on CPU (I don’t have a GPU on my machine), I can’t seem to reproduce this:

when create_graph is True
loss                     :18.81757354736328
/data/users/albandes/pytorch/3.6_debug_source/torch/functional.py:1242: UserWarning: torch.norm is deprecated and may be removed in a future PyTorch release. Use torch.linalg.norm instead.
  "torch.norm is deprecated and may be removed in a future PyTorch release. "
classifier grad norm     :394.94049072265625
bert embedding grad norm :20.38188934326172
When create_graph is False
loss                     :18.81757354736328
classifier grad norm     :394.94049072265625
bert embedding grad norm :20.38188934326172

So I will guess that some of the cuda libraries change and lead to different numerical value.
Can you try to:

disable cudnn: torch.backends.cudnn.enabled=False
increase the precision to double: input = input.double() and net.double()
test on cpu (remove the .cuda() and add a map_location=torch.device('cpu') to the model loading)

Sudarshan_Babu · September 14, 2020, 2:42pm

Hey thanks a lot for running the code !

But I ran the code on cpu and made the changes you asked me to make yet I got different results based on create_graph

I also fixed the random seed torch.manual_seed(0)

when create_graph is True
loss                     :15.955138471719597
classifier grad norm     :394.79707122347384
bert embedding grad norm :23.128852966924228
When create_graph is False
loss                     :15.955138471719597
classifier grad norm     :394.79707122347384
bert embedding grad norm :19.880596727123276

I have pushed my cpu runable version of my code in the git repo, could you run that please.

Could the difference be due to different version of pytorch ?
torch related libraries that I am using: torch==1.6.0, torchfile==0.1.0, torchvision==0.7.0

Thanks a lot for spending your time on this !!

albanD · September 14, 2020, 3:05pm

Hey,

I can’t reproduce on master or with the nightly builds:

[albandes ~/tmp/create_graph_debug] git pull
Updating 15a5a85..169926d
Fast-forward
 test.py | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)
[albandes@devvm138.atn0 ~/tmp/create_graph_debug] . ../../local/pytorch/3.6_debug_source_env/bin/activate
(3.6_debug_source_env) [albandes ~/tmp/create_graph_debug] python test.py
when create_graph is True
loss                     :15.955138471719597
/data/users/albandes/pytorch/3.6_debug_source/torch/functional.py:1242: UserWarning: torch.norm is deprecated and may be removed in a future PyTorch release. Use torch.linalg.norm instead.
  "torch.norm is deprecated and may be removed in a future PyTorch release. "
classifier grad norm     :394.79707122347384
bert embedding grad norm :19.880596727129774
When create_graph is False
loss                     :15.955138471719597
classifier grad norm     :394.79707122347384
bert embedding grad norm :19.880596727129774

But I can on 1.6:

(3.6_release_binary_env) [albandes@devvm138.atn0 ~/tmp/create_graph_debug] python test.py
when create_graph is True
loss                     :15.955138471719597
classifier grad norm     :394.7970712234737
bert embedding grad norm :23.12885296693177
When create_graph is False
loss                     :15.955138471719597
classifier grad norm     :394.7970712234737
bert embedding grad norm :19.88059672712989

Can you install the nightly build from here https://pytorch.org/get-started/locally/ and confirm this is working fine for you?

Sudarshan_Babu · September 14, 2020, 3:27pm

Yeah on nightly it works ! I get the same results as you, thanks a lot !