Strange behavior of Variable.cuda() and Variable.grad

The code where we call .cuda() on the variable:

a = Variable(torch.randn(2,10), requires_grad=True).cuda()
y = a + 10.0
y.backward(torch.ones(a.size()).cuda())
print a.grad

The output is:

Variable containing:
0     0     0     0     0     0     0     0     0     0
0     0     0     0     0     0     0     0     0     0
[torch.cuda.FloatTensor of size 2x10 (GPU 0)]

which is incorrect.
If we modify it and call .cuda() on the tensor:

a = Variable(torch.randn(2,10).cuda(), requires_grad=True)
y = a + 10.0
y.backward(torch.ones(a.size()).cuda())
print a.grad

The output is correct:

        Variable containing:
        1     1     1     1     1     1     1     1     1     1
        1     1     1     1     1     1     1     1     1     1
    [torch.cuda.FloatTensor of size 2x10 (GPU 0)]

I think this rather strange behavior needs some clarification or an exception throwing.
My pytorch version is 0.1.9+49295eb

2 Likes

The reason you see this is because a in the first code snippet is a non-leaf Variable (i.e. not user-created, but is the result of an operation)

We do not store gradients of non-leaf Variables, they have to be accessed by hooks.

In the second snippet, a is a leaf Variable, so gradients are correctly populated.

From pytorch version 0.1.10 onwards, the gradients of non-leaf Variables are actually None, so hopefully that’s the clarification / better behavior you would want.

>>> print a.grad
None
5 Likes

Can you elaborate why a is a leaf variable in the latter case, but not in the first? I don’t see it.

In the first example a is the result of the cuda() method call on a user-created Variable. i.e. Variable(torch.randn(2,10), requires_grad=True) is a leaf, but Variable(torch.randn(2,10), requires_grad=True).cuda() is a different variable, and is not.

So, am I getting it right that .cuda() produces a new node on the computational graph? That seems counter-intuitive. Why would you do that? .cuda() is certainly not differentiable.

2 Likes

If you take a look at the source it’s effectively acting as an identity function. I can’t speak to the design reasons, although it seems like the primary purpose is to allow GPU gradients to propagate to CPU Variables and vice-versa.

3 Likes

I would say this design makes me confused about when to use .cuda(), since I would expect either taking the cuda() with the tensor or with the variable is fine. (Though it is not).