Multiple-GPU Error - Data Parallel

Daniel_Whettam · July 10, 2019, 8:01pm

Hi there, I’m trying to run my code across multiple GPU’s and am getting the following error:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

I’ve seen a few posts around here and on https://github.com/pytorch/pytorch/, but nothing seems to be of use for me. I’m using a pre-trained model from https://github.com/osmr/imgclsmob, and have modified the forward function to return the activations as well as output. Here’s a simplified version of my code:

from pytorchcv.model_provider import get_model as ptcv_get_model
import torch
import types
net = ptcv_get_model("densenet40_k12_cifar10", root = 'loc', pretrained=True)
def my_forward(self, x):
    activations = []
    for module in self.features._modules.values():
        x = module(x) #error happens here
        activations.append(x)
    x = x.view(x.size(0), -1)
    x = self.output(x)
    return x, outs

net.forward = types.MethodType(my_forward, net)

if torch.cuda.device_count() > 1:
    net = nn.Dataparallel(net, device_ids=[0,1,2,3]
net.to(device)
net.eval()

And my full error message is:

Traceback (most recent call last):
  File "main.py", line 471, in <module>
    train_student(student, teach)
  File "main.py", line 155, in train_student
    outputs_teacher, ints_teacher = teach(inputs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 364, in my_forward
    x = module_val(x)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

Any suggestions would be really appreciated. Thanks!

ptrblck · July 10, 2019, 8:50pm

Do you create any tensors, parameters or modules on-the-fly in your forward method?
Could you post a code snippet to reproduce this error, so that we could have a look?

Daniel_Whettam · July 11, 2019, 11:19am

Thanks @ptrblck. This reproduces the error for me:

import os
from pytorchcv.model_provider import get_model as ptcv_get_model
import torch
import types
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

cifar_loc = '/disk/scratch/s1874193/datasets/cifar'

net = ptcv_get_model("densenet40_k12_cifar10", root = '/home/s1874193/Distillation/xdistill/pre_trained_models', pretrained=True)
def my_forward(self, x):
    activations = []
    for module in self.features._modules.values():
        x = module(x)
        activations.append(x)
    x = x.view(x.size(0), -1)
    x = self.output(x)
    return x, activations

net.forward = types.MethodType(my_forward, net)

if torch.cuda.device_count() > 1:
    net = nn.DataParallel(net, device_ids=[0,1,2,3])
net.to(device)
net.eval()

x = torch.randn(4, 3, 32, 32)
out, act = net(x)

I’m not doing anything with the forward() method other that what you can see here. I think it’s somehow related to how I’m using CIFAR, as I didn’t get the error just doing

x = torch.randn(1, 3, 32, 32)
out, activations = net(x)

Thanks!

ptrblck · July 11, 2019, 1:31pm

I’m not sure about the conclusion.
Try to pass more than a single sample and you should see the same error:

x = torch.randn(4, 3, 32, 32)
out, act = net(x)

I’ll try to dig into it a bit later.

Daniel_Whettam · July 11, 2019, 1:39pm

Yes, you’re right with that. Thank you! I’ll keep trying to get somewhere myself. I’ve edited the OP to clean things up using

x = torch.randn(4, 3, 32, 32)
out, act = net(x)```

Daniel_Whettam · July 26, 2019, 4:58pm

UPDATE: I managed to fix this by passing my model through a new class with my edited forward function, instead of using types.MethodType. Also important to not put anything .to(device) inside the fprop, as it reallocates the device after DataParallel has done its thing.

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device(torch.cuda.current_device() if torch.cuda.is_available() else "cpu")

net = ptcv_get_model("densenet40_k12_cifar10", pretrained=True)

class ReturnLayers(nn.Module):
    def __init__(self, model):
        super(ReturnLayers, self).__init__()
        self.model = model

    def forward(self, x):
        activations = []
        for module in self.model.features._modules.values():
            x = module(x)
            activations.append(x)
        x = x.view(x.size(0), -1)
        x = self.model.output(x)
        return x, activations

net = ReturnLayers(net).to(device)

if torch.cuda.device_count() > 1:
    net = nn.DataParallel(net)

net.eval()
x = torch.randn(4, 3, 32, 32)
out, act = net(x)

FruitVinegar · November 15, 2019, 3:56pm

I’m suffering similar problem likes you.
So your conclusion is,

do not modify network’s method(e.g. forward) with types.MethodType when you are going to use nn.DataParallel.

right?

So you just create new class instead of modifying existing one.

But I still want to find a way to parallel revised model from existing one, not creating new model architecture with copying code.
Is there good way to solve this?