Tensor.to() do NOT retain requires_grad info?

HI, I found a wired bug:

In [1]: import torch

In [2]: a=torch.tensor([2], requires_grad=True)

In [3]: b=a.to('cuda')

In [4]: a.requires_grad
Out[4]: True

In [5]: b.requires_grad
Out[5]: False

Why b do NOT keep requires_grad info from a ?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

1 Like

Hi,
I think the issue is just when it’s a scalar:

In [1]: import torch

In [2]: a = torch.rand(10, requires_grad=True)

In [3]: a.requires_grad
Out[3]: True

In [4]: b = a.to("cuda")

In [5]: b.requires_grad
Out[5]: True

@smth is that expected behaviour?

I’m looking into it; it doesn’t look to just be with scalars (I think you mean 1-element tensors, since the above isn’t 0-dim), because I get the correct behavior if I pass 1 to torch.rand instead of 10.

Oh, I see the issue. The tensor you created is not floating point, if you create a floating point tensor torch.tensor([2.], requires_grad=True) it works as expected. We recently merged some code that makes non-floating-point tensor calculations not require grad (I don’t know if that changed this specific code path or not, I have to check). I agree the result here is unintuitive and we’ll improve it.

2 Likes

The issue occurs also for float tensors. Despite the fact that requires_grad flag is retained. Proof:

import torch

device = torch.device("cuda")
x = torch.randn(5, 3, requires_grad=True).to(device)
x.sum().backward()
print(x.grad, x.requires_grad)  # (None, True)
1 Like

The example you just showed works as expected.
Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

@gchanan yes I think it’s because torch.tensor([2]) creates an integer typed tensor. Shouldn’t the constructor just fail if we ask for requires_grad=True?

@albanD I made it throw an error in https://github.com/pytorch/pytorch/pull/7185, although I think it should probably be just be warning (and not set requires_grad to True).

Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

I am a bit of newbie. Could you elaborate on this a bit more. Isn’t the result of the .to(device) operation the same leaf variable but just in the GPU? Thanks.

No it’s a new variable, that is on the gpu and that contains the same thing as the original cpu tensor.
Is the following sample making it clearer?

import torch

a = torch.rand(1, requires_grad=True)
b = a.cuda()
c = 3*b

c.sum().backward()
print(c.grad) # None:        Not a leaf
print(b.grad) # None:        Not a leaf
print(a.grad) # tensor([3]): A leaf


b = torch.rand(1, requires_grad=True).cuda()
c = 3*b

c.sum().backward()
print(c.grad) # None:        Not a leaf
print(b.grad) # None:        Not a leaf
1 Like

Yes. Thank you very much

@albanD, Thanks.
Could you explain the second question for me?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

in pytorch 0.3, we can write

a = torch.Tensor(...)
a.cuda()

but in python 0.4, we need

a = torch.Tensor(...)
a = a.to(device)

So .to() is not in-place operation?

a.cuda() is not inplace either.
If you do that in 0.3, a is still on CPU, only the returned value is on GPU :wink:

How then should we make a tensor.to() output to be a leaf of the graph? I get why it behaves this way, but not sure how to implement the case where GPU tensors are on the graph to calculate gradients. Thanks!

Hi,

this post should answer your question.

1 Like