Multiple-GPU Error - Data Parallel

Hi there, I’m trying to run my code across multiple GPU’s and am getting the following error:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

I’ve seen a few posts around here and on https://github.com/pytorch/pytorch/, but nothing seems to be of use for me. I’m using a pre-trained model from https://github.com/osmr/imgclsmob, and have modified the forward function to return the activations as well as output. Here’s a simplified version of my code:

from pytorchcv.model_provider import get_model as ptcv_get_model
import torch
import types
net = ptcv_get_model("densenet40_k12_cifar10", root = 'loc', pretrained=True)
def my_forward(self, x):
    activations = []
    for module in self.features._modules.values():
        x = module(x) #error happens here
        activations.append(x)
    x = x.view(x.size(0), -1)
    x = self.output(x)
    return x, outs

net.forward = types.MethodType(my_forward, net)

if torch.cuda.device_count() > 1:
    net = nn.Dataparallel(net, device_ids=[0,1,2,3]
net.to(device)
net.eval()

And my full error message is:

Traceback (most recent call last):
  File "main.py", line 471, in <module>
    train_student(student, teach)
  File "main.py", line 155, in train_student
    outputs_teacher, ints_teacher = teach(inputs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 364, in my_forward
    x = module_val(x)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1874193/miniconda3/envs/distill/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

Any suggestions would be really appreciated. Thanks!

Do you create any tensors, parameters or modules on-the-fly in your forward method?
Could you post a code snippet to reproduce this error, so that we could have a look?

Thanks @ptrblck. This reproduces the error for me:

import os
from pytorchcv.model_provider import get_model as ptcv_get_model
import torch
import types
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

cifar_loc = '/disk/scratch/s1874193/datasets/cifar'

net = ptcv_get_model("densenet40_k12_cifar10", root = '/home/s1874193/Distillation/xdistill/pre_trained_models', pretrained=True)
def my_forward(self, x):
    activations = []
    for module in self.features._modules.values():
        x = module(x)
        activations.append(x)
    x = x.view(x.size(0), -1)
    x = self.output(x)
    return x, activations

net.forward = types.MethodType(my_forward, net)

if torch.cuda.device_count() > 1:
    net = nn.DataParallel(net, device_ids=[0,1,2,3])
net.to(device)
net.eval()

x = torch.randn(4, 3, 32, 32)
out, act = net(x)


I’m not doing anything with the forward() method other that what you can see here. I think it’s somehow related to how I’m using CIFAR, as I didn’t get the error just doing

x = torch.randn(1, 3, 32, 32)
out, activations = net(x)

Thanks!

I’m not sure about the conclusion.
Try to pass more than a single sample and you should see the same error:

x = torch.randn(4, 3, 32, 32)
out, act = net(x)

I’ll try to dig into it a bit later.

Yes, you’re right with that. Thank you! I’ll keep trying to get somewhere myself. I’ve edited the OP to clean things up using

x = torch.randn(4, 3, 32, 32)
out, act = net(x)```