Question on a toy example, when .to(device) is necessary?

The following code is runnable and bug-free. I am puzzled why the self.relu dosen’t need to be cast to GPU device while the self.net1 and self.net1 is cast to GPU. In my considering, the default device of self.relu should be CPU if there is no explict casting. If this is correct, the output of self.net1 is on gpu, which mismatch the self.relu device type. Why this is bug-free?

import torch
import torch.nn as nn
import torch.optim as optim


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10).to('cuda:0')
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to('cuda:0')

    def forward(self, x):
        x = self.relu(self.net1(x.to('cuda:0')))
        return self.net2(x.to('cuda:0'))


model = ToyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cuda:0')
loss_fn(outputs, labels).backward()
optimizer.step()

nn.ReLU doesn’t contain any parameters, so nothing has to be moved to the device.
Internally the functional API will be used on the input (which is already on the GPU) via F.relu.

You can check for parameters in a modules via: print(dict(module.named_parameters())).

Also, while this code works fine (and I guess it’s just used as an example for this topic), I would generally recommend to remove the to operations from the model internals, and move the complete model to the device via model.to('cuda').
If you want to use model sharding using different devices, your workflow would be correct.

1 Like