It seems like both functions can be used to compute gradient. In fact, in their source code they call the same function_execution_engine.run_backward().
So what is the difference of these functions in terms of use cases? I would assume autograd.grad is more general?
.backward() is very convenient when working with torch.nn because it does not require the user to specify which Tensors we want the gradients for (and the optimizer knows where to look for these gradients). .grad() requires you to specify the Tensors you want the gradients for and won’t populate the .grad field.
You could re-implement one with the other but they are just slightly different APIs for different use cases.
If the .grad fields are not full of zero before you call .backward() then you can end up with different values.
Also what do you mean by “slightly different” exactly?
I figured that the difference was having create_graph set to True while using .backward(), and set to False while using .grad()
My understanding of create_graph is that it enables differentiation through the gradient computation step. I dont understand why this should affect the values of the computed gradients, any help clarifying this would be much appreciated. Thank you.
Again it depends how big the difference. If the difference is at the level of numerical precision (or one or two orders of magnitude larger if you test a full network), then this is because create_graph forces us to use a backward that is differentiable.
And because floating point arithmetic is not associative, using different implementations can lead to bit-wise difference in the result. These differences are then amplified by the later computations in the backward pass.
@albanD Thank you for helping me out, would you mind if I share a github link with the minimal example ?
I am using a model where the first layer is the bert embedding layer, and its attached to a classifier. The gradients for the classifier however match with and without create_graph
Its easy to run : python test.py and does not call any weird libraries haha !
The pre-trained model is also uploaded in the repo, and its essential for seeing the discrepancy. For randomly initialized models we dont see the discrepancy.
The whole thing takes a minute to run ! All the relevant action takes place between line 146 to line 170, the rest is just setting up the Bert model.
Thanks a lot for agreeing to take a look at the code.
After doing the change to run on CPU (I don’t have a GPU on my machine), I can’t seem to reproduce this:
when create_graph is True
loss :18.81757354736328
/data/users/albandes/pytorch/3.6_debug_source/torch/functional.py:1242: UserWarning: torch.norm is deprecated and may be removed in a future PyTorch release. Use torch.linalg.norm instead.
"torch.norm is deprecated and may be removed in a future PyTorch release. "
classifier grad norm :394.94049072265625
bert embedding grad norm :20.38188934326172
When create_graph is False
loss :18.81757354736328
classifier grad norm :394.94049072265625
bert embedding grad norm :20.38188934326172
So I will guess that some of the cuda libraries change and lead to different numerical value.
Can you try to:
disable cudnn: torch.backends.cudnn.enabled=False
increase the precision to double: input = input.double() and net.double()
test on cpu (remove the .cuda() and add a map_location=torch.device('cpu') to the model loading)
But I ran the code on cpu and made the changes you asked me to make yet I got different results based on create_graph
I also fixed the random seed torch.manual_seed(0)
when create_graph is True
loss :15.955138471719597
classifier grad norm :394.79707122347384
bert embedding grad norm :23.128852966924228
When create_graph is False
loss :15.955138471719597
classifier grad norm :394.79707122347384
bert embedding grad norm :19.880596727123276
I have pushed my cpu runable version of my code in the git repo, could you run that please.
Could the difference be due to different version of pytorch ?
torch related libraries that I am using: torch==1.6.0, torchfile==0.1.0, torchvision==0.7.0
I can’t reproduce on master or with the nightly builds:
[albandes ~/tmp/create_graph_debug] git pull
Updating 15a5a85..169926d
Fast-forward
test.py | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
[albandes@devvm138.atn0 ~/tmp/create_graph_debug] . ../../local/pytorch/3.6_debug_source_env/bin/activate
(3.6_debug_source_env) [albandes ~/tmp/create_graph_debug] python test.py
when create_graph is True
loss :15.955138471719597
/data/users/albandes/pytorch/3.6_debug_source/torch/functional.py:1242: UserWarning: torch.norm is deprecated and may be removed in a future PyTorch release. Use torch.linalg.norm instead.
"torch.norm is deprecated and may be removed in a future PyTorch release. "
classifier grad norm :394.79707122347384
bert embedding grad norm :19.880596727129774
When create_graph is False
loss :15.955138471719597
classifier grad norm :394.79707122347384
bert embedding grad norm :19.880596727129774
But I can on 1.6:
(3.6_release_binary_env) [albandes@devvm138.atn0 ~/tmp/create_graph_debug] python test.py
when create_graph is True
loss :15.955138471719597
classifier grad norm :394.7970712234737
bert embedding grad norm :23.12885296693177
When create_graph is False
loss :15.955138471719597
classifier grad norm :394.7970712234737
bert embedding grad norm :19.88059672712989