Dataparallel with custom model

Greetings,

I don’t quite get how to use the Dataprallel wrapper to use multiple GPUs for my custom model.

In my particular case, I wrote my model with an evaluate member function that already uses the device. In my torch framework, all my train routines expect the models to have this evaluate function.

My model is a custom Resnet model, built on ResNet18 from TorchVision.

import torch.nn as nn
import torchvision
import torch

class ResNet(nn.Module):
    def __init__(self, model, device):
        super(ResNet, self).__init__()
        self.resnet = model
        self.device = device
        if not isinstance(model, torchvision.models.ResNet):
            raise ValueError("The given model is not an instance of resnet.")

    def forward(self, features):
        return self.resnet.forward(features)
    
    def evaluate(self, data_loader):
        self.eval()
        loss = 0
        correct = 0
        criterion = nn.CrossEntropyLoss(reduction="sum")
        with torch.no_grad():
            for data, target in data_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.forward(data)
                loss += criterion(output, target).data.item()               # sum up batch loss
                correct += (target == torch.argmax(output, dim=1)).cpu().sum() # does cpu make sense?
    
        accu = 100. * correct / len(data_loader.dataset)
        
        return (loss, accu)
    

model = torchvision.models.resnet18()  ## get predefined resnet.

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')  

res_model = ResNet(model, device)  ## instantiate custom model

res_model_wrapped = nn.DataParallel(res_model,device_ids = [0,3])  ## wrap  in DataParallel for GPUS 0 and 3
res_model_wrapped.to(device)


### call train routine on `res_model_wrapped.module`

The code runs like this, but only one of the GPU is working, the one with index 0.
Any idea how I can make this work?

When you use it as this:

torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')  

You are specifying that you will be using the one with index 0.
If you want to use all the available GPUs:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

But if you want to use with specific GPUs:

device = torch.device("cuda:1,3" if torch.cuda.is_available() else "cpu")

Using
device = torch.device("cuda:0,3" if torch.cuda.is_available() else "cpu")
gives me
RuntimeError: Invalid device string: 'cuda:0,3'

This is wrong as it will skip the nn.DataParallel wrapper and call the internal module directly. nn.DataParallel will chunk the input data and will use the forward method of your model for each input chunk on the corresponding device.

1 Like

If I call the train routine simply on res_model_wrapped,
I get the error
AttributeError: 'DataParallel' object has no attribute 'evaluate', because the train routine calls the evaluate function.

Should I call res_model_wrapped.module.evaluate() at these points? Will it be a problem that the evaluate function uses device?

You would need to call custom methods from the forward function without hard-coding any device, since the forward is used in nn.DataParallel for the actual data parallel application. Using the internal .module will skip nn.DataParallel and you will use the single-device module again.

So my train routine should not call any member functions of the module wrapped in nn.DataParallel that explicitly uses device?