In my training loop, I forward idx1 and idx2 to get output, which is a Variable.
As shown, tmp and tmp1 are intermediate Variables whose gradients are not necessary, because I can write the following instead by substituting tmp and tmp1 as follows:
This is not what detach() is for.
You can see detach as being a breakpoint so that no gradient will flow above this point.
In your case, if you detach tmp and tmp1, no gradients will be propagated to embeds1 and embeds2.
In pytorch, the gradients are only computed for the Variables that are created by the user with the parameter requires_grad=True. So no gradient will be computed for the temporary Variables by default.
When using nn, keep in mind that creating an nn.Parameter is the same as a Variable with requires_grad=True.
But when I look at tmp.requires_grad in the forward() function, it returns True. So you are saying that even though it says True, gradients are not computed. Am I right?
If a Variable has var.requires_grad=True, that means that it has been computed from a Variable created with the argument requires_grad=True (called leaf Variables).
And thus, to compute the gradients for all the leaf Variables, we will need to have some gradients flowing back through this Variable.