What to send to GPU?

Hello there,

I am a bit confused about what to send to the GPU.

Take a look at these routines:

def train(model, train_loader, datasize, lr, weight_decay, alpha, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        model.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)*datasize
        loss.backward()
        update_params(model, lr, weight_decay, alpha)

def update_params(model, lr, weight_decay, alpha):
    for p in model.parameters():
        if not hasattr(p,'buf'):
            p.buf = torch.randn(p.size()).to(device)*np.sqrt(lr) 
        d_p = p.grad.data
        d_p.add_(weight_decay, p.data)
        eps = torch.randn(p.size()).to(device)
        buf_new = (1-alpha)*p.buf - lr*d_p + (2.0*lr*alpha)**.5*eps
        p.data.add_(buf_new)
        p.buf = buf_new

In the train-routine, are the “output” and “loss” variables automatically created on the GPU?

In the “update” routine, are the “lr” and “alpha” constants automatically send to the GPU as well?

What if I have a code snipped like this where “self.model” is already on the GPU:

        loss_train = np.zeros(self.epochs+1)                 # store loss
        accu_train = np.zeros(self.epochs+1)                 # store accuracies
        loss_test = np.zeros(self.epochs+1)
        accu_test = np.zeros(self.epochs+1)
        
        (loss_train[0], accu_train[0]) = self.model.evaluate(self.train_loader)
        (loss_test[0], accu_test[0]) = self.model.evaluate(self.test_loader) 

Will the loss and accu arrays be on GPU as well? Or will the results of the right-hand side be sent from GPU to CPU first?

Cheers!

Yes, if the model parameters and the input used to calculate the output were pushed to the GPU.

Yes, scalar values are automatically pushed to the GPU if needed.

It depend on the implementation of the evaluate method which isn’t a built-in PyTorch method of a module.

1 Like

Thank you ptrblck!

The model with the evaluate function is given by:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28*28, 500)
        self.fc2 = nn.Linear(500, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.log_softmax(self.fc3(x), dim=1)
    
    def evaluate(self, data_loader):
        self.eval()
        loss = 0
        correct = 0
        criterion = nn.NLLLoss(reduction='sum')
        with torch.no_grad():
            for data, target in data_loader:
                data, target = data.to(device), target.to(device)
                output = self.forward(data)
                loss += criterion(output, target).data.item()             # sum up batch loss
                pred = output.data.max(1, keepdim=True)[1]                # get the index of the max log-probability
                correct += pred.eq(target.data.view_as(pred)).cpu().sum() # does cpu make sense?
    
        loss /= len(data_loader.dataset)
        accu = 100. * correct / len(data_loader.dataset)
        
        return (loss, accu)

So if I call it via

(loss_train[0], accu_train[0]) = self.model.evaluate(self.train_loader)

will the arrays on the left be on GPU? Do questions like these even matter?

No, since loss is a Python scalar value created via item() while accu was created from correct which was explicitly pushed to the CPU:

correct += pred.eq(target.data.view_as(pred)).cpu().sum() # does cpu make sense?

Sorry I am still unclear… loss and correct were created via

loss = 0
correct = 0

so they were not explicitly pushed to the GPU, but they are created in member function evaluate of an object that was pushed to the GPU. From that reasoning, it would not be clear to me what happens… You seem to say that they are not created on the GPU. OK, then moving on.
In the following snippet, we have

loss += criterion(output, target).data.item()            
pred = output.data.max(1, keepdim=True)[1]              
correct += pred.eq(target.data.view_as(pred)).cpu().sum()

The item( ) function has no influence on the device I assume. The cpu() function of course does. I assume that without the cpu( ), the expression

pred.eq(target.data.view_as(pred)).cpu().sum()

would be created on the GPU as target and pred are on the GPU, no?

Is there a detailed documentation somewhere that explains when stuff is automatically pushed to the GPU? I feel like not understanding that in detail is the cause of many bottlenecks in practice…

No, that’s not what I explained. The intermediate tensors were created in the GPU (assuming device is pointing to a GPU), but you are explicitly moving the results back to the CPU via the item() and cpu() calls. Even if the tensors were on the GPU previously, you are moving them to the CPU so that loss and accu are both on the CPU.

That’s wrong, since item() creates a Python scalar which as to be stored on the CPU.

Yes, that’s correct. The cpu() call moves the result tensor to the CPU.

1 Like

And if I only have

loss = 0
correct = 0

in the evaluate function,
this would be normal Python scalars on the CPU, even if the object that calls evaluate is on the GPU?

Yes, both variables would be plain Python integers on the CPU, but depending on the type of the object of the later operation you would override these variables and could create tensors as seen here:

loss = 1
loss += torch.ones(1, device='cuda')
print(loss)
# tensor([2.], device='cuda:0')
1 Like

Thank you ptrblck,

I feel more confident now.