In the second snippet, a is a leaf Variable, so gradients are correctly populated.
From pytorch version 0.1.10 onwards, the gradients of non-leaf Variables are actually None, so hopefully that’s the clarification / better behavior you would want.
In the first example a is the result of the cuda() method call on a user-created Variable. i.e. Variable(torch.randn(2,10), requires_grad=True) is a leaf, but Variable(torch.randn(2,10), requires_grad=True).cuda() is a different variable, and is not.
So, am I getting it right that .cuda() produces a new node on the computational graph? That seems counter-intuitive. Why would you do that? .cuda() is certainly not differentiable.
If you take a look at the source it’s effectively acting as an identity function. I can’t speak to the design reasons, although it seems like the primary purpose is to allow GPU gradients to propagate to CPU Variables and vice-versa.
I would say this design makes me confused about when to use .cuda(), since I would expect either taking the cuda() with the tensor or with the variable is fine. (Though it is not).