Net.cuda() move all parameters to GPU, then how to ensure the autograd?

zplovekq · July 7, 2020, 1:14pm

If i do this simply code:

  w=torch.tensor([1.0],requires_grad=True)
  w=w.cuda()
  #some compute 
  l.backward()
  #w.grad will still be NoneType

Then the tensor w will not be the leaf node in autograd’s graph (the w will be copybackward).
The parameters in net are all initialized as “requires_grad=True”.
So if i do net.cuda(), is this equal to do param=param.cuda()? Could meet the same problem above?
Looking forward for apply!

albanD · July 7, 2020, 2:08pm

Hi,

When you do net.cuda(), you should do it before you give the parameters to the optimizer.
And the conversion to cuda is actually not done in a differentiable manner, so the Parameters remains leaf (but now on a different device).

zplovekq · July 8, 2020, 3:15am

Thanks for your reply!~
I understand by this:
if I do tensor.cuda(), it will return a copy in GPU and this operation is differentiable( copybackward).
But if i do net.cuda() , the return value is self. So this operation is not differentiable( not copy but just move?)
If i understand right ? There is difference between tensor.cuda() and net.cuda() ?

albanD · July 8, 2020, 3:07pm

Yes the two are very different.

tensor.cuda() is out of place: tensor itself is not modified.
The operation on Tensor (like all ops on Tensors) is differentiable.

net.cuda() is inplace! It changes the nn.Module and the returned value is the same as net.
The operation is not differentiable, you can see it as doing:

# In pseudo code (this won't work with nested nn.Module
with torch.no_grad():
    for name, param in net.named_parameters():
        net.name = param.cuda()

zplovekq · July 9, 2020, 2:51am

Thanks~ I do some experiment:
For tensor, the code like this:

b=torch.tensor([1.0,2.0],requires_grad=True)

with torch.no_grad():
   b=b.cuda() #b.requires_grad will be false
   b.requires_grad=True

y=(b*b).sum()
y.backward()
b.grad #this will work

And for net, i can do this:

for name, param in net.named_parameters():
        with torch.no_grad(): #I find dropout this line not effect the result,why....
            exec(f'net.{name}=torch.nn.Parameter(net.{name}.cuda())') #this will work

The above code will work as expected. And for net, what is the effect for with torch.no_grad()? Because i find without this line it still works as expected.
Thanks a lot!

albanD · July 9, 2020, 3:08pm

It has no effect because a nn.Parameter() is always a leaf (so that its .grad field will be populated when doing .backward()). And so wrapping inside the Parameter breaks the differentiability. This is why the no_grad() has no effect.