In [1]: import torch
In [2]: a=torch.tensor([2], requires_grad=True)
In [3]: b=a.to('cuda')
In [4]: a.requires_grad
Out[4]: True
In [5]: b.requires_grad
Out[5]: False

Why b do NOT keep requires_grad info from a ?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

In [1]: import torch
In [2]: a = torch.rand(10, requires_grad=True)
In [3]: a.requires_grad
Out[3]: True
In [4]: b = a.to("cuda")
In [5]: b.requires_grad
Out[5]: True

I’m looking into it; it doesn’t look to just be with scalars (I think you mean 1-element tensors, since the above isn’t 0-dim), because I get the correct behavior if I pass 1 to torch.rand instead of 10.

Oh, I see the issue. The tensor you created is not floating point, if you create a floating point tensor torch.tensor([2.], requires_grad=True) it works as expected. We recently merged some code that makes non-floating-point tensor calculations not require grad (I don’t know if that changed this specific code path or not, I have to check). I agree the result here is unintuitive and we’ll improve it.

The example you just showed works as expected.
Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

@gchanan yes I think it’s because torch.tensor([2]) creates an integer typed tensor. Shouldn’t the constructor just fail if we ask for requires_grad=True?

Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

I am a bit of newbie. Could you elaborate on this a bit more. Isn’t the result of the .to(device) operation the same leaf variable but just in the GPU? Thanks.

No it’s a new variable, that is on the gpu and that contains the same thing as the original cpu tensor.
Is the following sample making it clearer?

import torch
a = torch.rand(1, requires_grad=True)
b = a.cuda()
c = 3*b
c.sum().backward()
print(c.grad) # None: Not a leaf
print(b.grad) # None: Not a leaf
print(a.grad) # tensor([3]): A leaf
b = torch.rand(1, requires_grad=True).cuda()
c = 3*b
c.sum().backward()
print(c.grad) # None: Not a leaf
print(b.grad) # None: Not a leaf

@albanD, Thanks.
Could you explain the second question for me?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

How then should we make a tensor.to() output to be a leaf of the graph? I get why it behaves this way, but not sure how to implement the case where GPU tensors are on the graph to calculate gradients. Thanks!