Tensor.to() do NOT retain requires_grad info?

dragen · May 2, 2018, 10:50am

HI, I found a wired bug:

In [1]: import torch

In [2]: a=torch.tensor([2], requires_grad=True)

In [3]: b=a.to('cuda')

In [4]: a.requires_grad
Out[4]: True

In [5]: b.requires_grad
Out[5]: False

Why b do NOT keep requires_grad info from a ?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

albanD · May 2, 2018, 11:42am

Hi,
I think the issue is just when it’s a scalar:

In [1]: import torch

In [2]: a = torch.rand(10, requires_grad=True)

In [3]: a.requires_grad
Out[3]: True

In [4]: b = a.to("cuda")

In [5]: b.requires_grad
Out[5]: True

@smth is that expected behaviour?

gchanan · May 2, 2018, 2:24pm

I’m looking into it; it doesn’t look to just be with scalars (I think you mean 1-element tensors, since the above isn’t 0-dim), because I get the correct behavior if I pass 1 to torch.rand instead of 10.

gchanan · May 2, 2018, 3:39pm

Oh, I see the issue. The tensor you created is not floating point, if you create a floating point tensor torch.tensor([2.], requires_grad=True) it works as expected. We recently merged some code that makes non-floating-point tensor calculations not require grad (I don’t know if that changed this specific code path or not, I have to check). I agree the result here is unintuitive and we’ll improve it.

bottaio · May 3, 2018, 7:13am

The issue occurs also for float tensors. Despite the fact that requires_grad flag is retained. Proof:

import torch

device = torch.device("cuda")
x = torch.randn(5, 3, requires_grad=True).to(device)
x.sum().backward()
print(x.grad, x.requires_grad)  # (None, True)

albanD · May 3, 2018, 12:34pm

The example you just showed works as expected.
Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

@gchanan yes I think it’s because torch.tensor([2]) creates an integer typed tensor. Shouldn’t the constructor just fail if we ask for requires_grad=True?

gchanan · May 3, 2018, 1:42pm

@albanD I made it throw an error in https://github.com/pytorch/pytorch/pull/7185, although I think it should probably be just be warning (and not set requires_grad to True).

Diego · May 3, 2018, 3:37pm

Here x does not contain the leaf Variable for which gradients are computed, but the result of the operation .to(device).

I am a bit of newbie. Could you elaborate on this a bit more. Isn’t the result of the .to(device) operation the same leaf variable but just in the GPU? Thanks.

albanD · May 3, 2018, 3:42pm

No it’s a new variable, that is on the gpu and that contains the same thing as the original cpu tensor.
Is the following sample making it clearer?

import torch

a = torch.rand(1, requires_grad=True)
b = a.cuda()
c = 3*b

c.sum().backward()
print(c.grad) # None:        Not a leaf
print(b.grad) # None:        Not a leaf
print(a.grad) # tensor([3]): A leaf


b = torch.rand(1, requires_grad=True).cuda()
c = 3*b

c.sum().backward()
print(c.grad) # None:        Not a leaf
print(b.grad) # None:        Not a leaf

Diego · May 3, 2018, 3:50pm

Yes. Thank you very much

dragen · May 4, 2018, 7:28am

@albanD, Thanks.
Could you explain the second question for me?

Besides, torch.to() seemingly not a in-place operation, which means, it’s different from .cuda(), So i feel confused since the migration guide told us to us .to() to replace .cuda()?

in pytorch 0.3, we can write

a = torch.Tensor(...)
a.cuda()

but in python 0.4, we need

a = torch.Tensor(...)
a = a.to(device)

So .to() is not in-place operation?

albanD · May 4, 2018, 8:52am

a.cuda() is not inplace either.
If you do that in 0.3, a is still on CPU, only the returned value is on GPU

lmnt · July 27, 2018, 3:21pm

How then should we make a tensor.to() output to be a leaf of the graph? I get why it behaves this way, but not sure how to implement the case where GPU tensors are on the graph to calculate gradients. Thanks!

albanD · July 27, 2018, 3:28pm

Hi,

this post should answer your question.