What is the difference between doing `net.cuda()` vs `net.to(device)`?

pinocchio · February 10, 2020, 10:07pm

I was going through this post ([SOLVED] Make Sure That Pytorch Using GPU To Compute) and I had the question, what is the difference between these two pieces of code?

import torch.nn as nn
net = nn.Sequential(OrderedDict( [ ('fc1',nn.Linear(3,1)) ]) )
net.cuda()

vs

import torch
import torch.nn as nn

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

net = nn.Sequential( OrderedDict([ ('fc1', nn.Linear(3,1)) ]) )
net.to(device)

vs

import torch
import torch.nn as nn

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

net = nn.Sequential( OrderedDict([ ('fc1', nn.Linear(3,1)) ]) )
net = net.to(device)

which one is the recommended one? Which one is the one that is hardware agnostic (i.e. no matter type of gpu or even cpu).

is there some sort of internal flag I can check to see if things are properly placed in GPU?

nairbv · February 10, 2020, 10:45pm

cuda() and to('cuda') are going to do the same thing, but the later is more flexible. As you can see in your example code, you can specify a device that might be ‘cpu’ if cuda is unavailable.

If you attempt to call cuda() on a system that doesn’t have a GPU, you’ll get:
AssertionError: Torch not compiled with CUDA enabled.

With the explicit call, you can also use multiple cuda devices – e.g. to('cuda:0') is different from to('cuda:1'). The simpler cuda() call will just use the default cuda device.

pinocchio · February 10, 2020, 10:46pm

what about:

import torch
import torch.nn as nn

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

net = nn.Sequential( OrderedDict([ ('fc1', nn.Linear(3,1)) ]) )
net = net.to(device)

vs

import torch
import torch.nn as nn

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

net = nn.Sequential( OrderedDict([ ('fc1', nn.Linear(3,1)) ]) )
net.to(device)

?

Whats the difference? Which one do I choose?

nairbv · February 10, 2020, 10:53pm

Your second variant is copying the net to the device, but not assigning the copy to anything. to(device) is not an in-place operation, so this effectively doesn’t do anything.

pinocchio · February 10, 2020, 10:55pm

Not sure if thats correct. It seems that there are no errors thrown (if I have data not in cuda/gpu the type of the tensor won’t match the net so an error would be thrown). So it seems that net.to(device) mutates my net and must put it in gpu…otherwise I would have seen an error thrown.

I wonder if this a unintended bug in pytorch and where they meant net = net.to(device) to be the only thing to work.

nairbv · February 10, 2020, 10:59pm

ah, you’re correct. I’m used to tensor to which is out of place. It might be better habit to re-assign, as I have seen a number of mistakes where a tensor.to(device) was assumed to be in-place.

In [14]: net=torch.nn.Linear(3,4)
In [15]: net.weight.device
Out[15]: device(type='cpu')
In [16]: net.to('cuda')
Out[16]: Linear(in_features=3, out_features=4, bias=True)
In [17]: net.weight.device
Out[17]: device(type='cuda', index=0)

pinocchio · February 11, 2020, 8:36pm

what does the in-placeness of .to(device) have to do with this?

I’m confused.

ptrblck · February 11, 2020, 9:35pm

If you call .to() on a tensor, the operation will not be performed in-place, thus you need to assign a new variable to the it.
On the module however, all parameters will be pushed internally to the specified device, so you don’t need to assign the model back to the call.

However, as @nairbv mentioned, it might be a good habit just to use the assignment by default to avoid possible errors.

pinocchio · February 11, 2020, 9:36pm

This is what I’m trying to understand. I always expected .to(device) to work the way it does. I don’t understand why I need to re-assign. I always expected .to(device) to mutate things.

ptrblck · February 11, 2020, 9:48pm

It does not work in-place on tensors, so you have to reassign the result:

x = torch.randn(1, device='cpu')
print(x.device)
> cpu

x.to('cuda')
print(x.device)
> cpu

x = x.to('cuda')
print(x.device)
> cuda:0

Ved · September 20, 2021, 8:43am

@ptrblck You have been so patient in explaining this

Ardeal · November 1, 2021, 1:12am

@ptrblck ,
Do you mean that tensor.to() is not in-place, and tensor.cuda() is in-place?
that is the major difference with to() and cuda()?

ptrblck · November 1, 2021, 8:08am

No, neither are in-place on a tensor.
Calling to() or cuda() on an nn.Module object will internally move all parameters and buffers to the device, dtype, or memory format so you wouldn’t need to reassign a model:

model.to('cuda') # works
tensor.to('cuda') # tensor is still on the original device afterwards, as the CUDATensor wasn't assigned

Ardeal · November 1, 2021, 8:30am

@ptrblck ,
Thank you!

there is no difference between to() and cuda().
there is difference when we use to() and cuda() between Module and tensor:
on Module(i.e. network), Module will be moved to destination device,
on tensor, it will still be on original device. the returned tensor will be move to destination device

right?

ptrblck · November 1, 2021, 8:32am

Yes, almost. The first point is right, in case you are only concerned about the device, but to() can also change the dtype, memory-layout etc. so has more functionality that cuda()/cpu().

Ardeal · November 1, 2021, 8:35am

@ptrblck ,
Thank you!

if we only consider device, there is no difference between to() and cuda().
if we consider other functionalities, to() has much more functionalities than cuda()

Ardeal · December 16, 2021, 2:02am

@ptrblck ,

One more questions:

# code 1:
device=torch.device('cuda')
net.to(device)

# code 2:
device=torch.device('cuda:0') # 'cuda:1'
net.to(device)

what is difference between ‘cuda’ and ‘cuda:0’ or ‘cuda:1’? what case should I use them in?

ptrblck · December 16, 2021, 7:13am

torch.device('cuda') (or just the 'cuda' string) will use the default device, while torch.device('cuda:1') (or the cuda:1 string) will explicitly use GPU1.
The CUDA semantics docs explain this behavior with some examples:

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)