Tensors are in multiple cuda devices

Hi
In various post, I have seen the comments on how to utilize
torch.nn.DataParallel to use multiple GPUs. Also, my code follow the template provided in this link. However, during the convolutional operation, I face such error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Could you please help me out with this error?

Can you post a minimal reproducible example?

Thank you for your reply.

Actually, the code is right now complex, so if I want to provide a minimal example of that, it might not be my case. However, some important parts of my code are as below.

In here, model is Alexnet pre-trained by Image-Net.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(model)
model = model.to(device)

During the passing an image through the layers, the featuremaps produced by each filter are multiply by a weight called layer_landas like bellow.
x is input image. The x tensor is also transferred to gpu by .to(device) command before passing through each module of Alexnet.

layer_landas = layer_landas.to(device)
out_fm = self.conv_layer(x)
out_fm = torch.mul(layer_landas, out_fm)

The error is:

File "/project/Code/cnn.py", line 61, in __call__
out_fm = torch.mul(layer_landas, out_fm)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I guess you’ve defined device once globally and are now reusing it in the forward in:

layer_landas = layer_landas.to(device)

If so, this would cause the issue since nn.DataParallel creates copies of the model on all specified devices, while you are moving laber_landas to device explicitly.
Use layer_landas = layer_landas.to(x.device) and it might work as the input tensor’s device would be used instead.

Thank you @ptrblck for replying.

I followed your suggestion, but the same error was throwed in another line when I want to pass the feature map from previous layer into the next conv layer.

out_fm = self.conv_layer(x)

The error is:

out_fm = self.conv_layer(x)
   File "/opt/conda/envs/pytorch1_10/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
     return forward_call(*input, **kwargs)
   File "/opt/conda/envs/pytorch1_10/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 446, in forward
2    return self._conv_forward(input, self.weight, self.bias)
   File "/opt/conda/envs/pytorch1_10/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
     return F.conv2d(input, weight, bias, self.stride,
 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper___conv_depthwise2d)

Some part of my code.

main function:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(model)
model = model.to(device)

call function in model class (inheres nn.module): in this function, I pass the featuremaps through each module one by one. the feature_extractor, move on each module of feature part pf the CNN. you can see its body in the next code snippet.

    def __call__(self, data):
        featuremaps = []
        for name, module in self.modified_cnn._modules.items():
            if name == 'features':
                featuremaps, x = self.feature_extractor(data, self.modified_cnn.features, self.landa)
            elif "avgpool" in name.lower():
                x = module(x)
                x = x.view(x.size(0), -1)
            else:
                x = module(x)

        return featuremaps, x

class FeatureExtractor:

in here, when I reach to a custom_convlayer modul, pass the featuemap x and layer_landas to the module. The next code snippet shows a part of this module.

def __call__(self, x, model, graph_landas=None):
    self.gradients = []
    featuremaps = []
    for name, module in model._modules.items():
        if type(module) == custom_convlayer:
            x = module((x, layer_landas))
        elif type(module) == torch.nn.modules.activation.ReLU:
            xx = torch.clone(x)
            x = module(xx.to(device))
            del xx
        else:
            x = module(x.to(device))

    return featuremaps, x

class custom_convlayer
In here when I want to pass featuremap into the conv layer, that error is throwed.


    def __call__(self, data):

        x = data[0]
        layer_landas = data[1]

        out_fm = self.conv_layer(x)

I guess you are hitting the same issue again.
Don’t use the global device attribute, as nn.DataParallel will create copies on all passed GPUs.
E.g. if you are passing 2 GPU ids to nn.DataParallel each GPU will get a clone of the model and the input data will also be split in the batch dimension and pushed to the corresponding device.
The model on cuda:0 will then get the input tensor on cuda:0 and the clone on cuda:1 will get the input tensor on cuda:1.
If you are now creating new tensors inside the model with device='cuda:0' it will raise a device mismatch, so use the .device attribute of the input or any registered parameter.

Also, don’t use the __call__ method, but implement the forward since the __call__ is used internally in nn.Modules.

Ok. As I understud, I shouldn’t transfer manually the input into the device and since I use nn.DataParallel, such transformation is done internally. So I comment out that part of my code where I was transferring the input image into the device. However, the issue remained unchange, and the error the previous error is throwed.

part of the code that I pass input image into the model.

for inputs, labels, paths in data_loaders['train']:
      optimizer.zero_grad()
      #inputs = inputs.to(device)
      _, modified_output = modified_model((inputs, settings.class_index))

@ptrblck
The problem was solved. I comment out anywhere that I transfer manually the input tensor into device.

Thank you very much @ptrblck.

Dear @ptrblck

I have a follow up issue regarding this.
In my code, I defined a parameter as a type nn.ParameterDict. During the cpu mode and gpu mode with one gpu, when I print the keys of this dictionary the output is as follow.

odict_keys(['0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9'])

However, when I want to use multiple gpus, this dictionary is empty while its not empty and the program throws error. The out out is as follow.

odict_keys([])
odict_keys([])
odict_keys([])
File "/project_antwerp/PathExtraction/Code/models.py", line 53, in __call__
  class_landas = graph_landas[class_index]
   File "/opt/conda/envs/pytorch1_10/lib/python3.9/site-packages/torch/nn/modules/container.py", line 586, in __getitem__
     return self._parameters[key]
 KeyError: '18'

I was wondering if you could please give me a help regarding this matter.

Could you post a minimal, executable code snippet which would reproduce this issue, please?
Without seeing code I would guess that you might be creating the ParameterDict object too late, i.e.after the DataParallel wrapping.

@ptrblck I provided a minimal case of my code that can be run. I put two print function in the class FeatureExtractor where I want to display the content of the parameter called graph_landas. Also, I create the parameter before wrapping it by nn.DataParallel.

Thank you in advance for your attention given to my issue.

from torch import nn
from torchvision import models
from torchvision import datasets
from torchvision import transforms
from PIL import Image
import torch
from torch import optim
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class FeatureExtractor():
    def __init__(self):
        self.gradients = []

    def save_gradient(self, grad):
        self.gradients.append(grad)

    def __call__(self, x, model, graph_landas=None):
        self.gradients = []
        featuremaps = []
        labels = x[1]
        x = x[0]
        print('printing landa parameter ', graph_landas)
        graph_landas = graph_landas.to(x.device)
        class_landas = None
        for class_index in [0,1,2,3,4]:
            class_index = str(class_index)
            if class_landas == None:
                class_landas = graph_landas[class_index]
            else:
                class_landas = torch.concat((class_landas, graph_landas[class_index]), 1)

        print('landa belonging to each class \n', class_landas)

        return featuremaps, x

class create_modifiedCNN(nn.Module):
    def __init__(self, model):
        super(create_modifiedCNN, self).__init__()
        self.modified_cnn = model
        self.feature_extractor = FeatureExtractor()
        num_classes = 5
        self.landa = nn.ParameterDict({
            str(class_index) : nn.Parameter(torch.ones((10,1),device=device),requires_grad=True)
            for class_index in range(num_classes)
        })

    def __call__(self, data):
        featuremaps = []
        for name, module in self.modified_cnn._modules.items():
            if name == 'features':
                featuremaps, x = self.feature_extractor(data, self.modified_cnn.features, self.landa)
            elif "avgpool" in name.lower():
                x = module(x)
                x = x.view(x.size(0), -1)
            else:
                x = module(x)
        return featuremaps, x



model = models.vgg16(pretrained=True)
out_feature = 5
in_feature = model.classifier[6].in_features
model.classifier[6] = nn.Linear(in_features=in_feature, out_features=out_feature, bias=True)
modified_model = torch.nn.DataParallel(create_modifiedCNN(model))
modified_model = modified_model.to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer_specs = [{'params': modified_model.module.landa[class_index]}
                   for class_index in modified_model.module.landa.keys()]
optimizer = optim.Adam(optimizer_specs, lr=0.1)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

#### Creating data transformation
image_size = 128
data_transform = {'train': transforms.Compose([
        transforms.Resize((image_size, image_size), Image.BILINEAR),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
        'val': transforms.Compose([
            transforms.Resize((image_size, image_size), Image.BILINEAR),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),

        'test': transforms.Compose([
            transforms.Resize((image_size, image_size), Image.BILINEAR),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    }

### Creating dataloader
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=data_transform['train'])
train_loader = torch.utils.data.DataLoader(trainset, batch_size=1,shuffle=True, num_workers=1)
testset = datasets.CIFAR10(root='./data', train=False,download=True, transform=data_transform['test'])
test_loader = torch.utils.data.DataLoader(testset, batch_size=1,shuffle=False, num_workers=1)
dataloaders = {'train': train_loader, 'test': test_loader}

### Feeding the model with images
for inputs, labels in dataloaders['train']:
    optimizer.zero_grad()
    class_index = labels
    _, modified_output = modified_model((inputs,class_index))

I cannot reproduce the issue as your code runs into a shape mismatch:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x147 and 25088x4096)

After adding:

model.classifier[0] = nn.Linear(147, 4096)

the code runs fine and the parameters are printed.

Thank you @ptrblck for try the code.

Could you please tell me what is your batchsize and how many gpus you used?because when I use one gpu the code is correct, but when I use more than one gpu, I face such issue. Also, I didn’t understand where you faced the error you mentioned. Normally you should got the error in FeatureExtractor because I didn’t implement complete code for passing the tensor through the module. The issue happens before passing to any module.
I still get the error on the parameter. I sent you a screenshot of the output.
As you can see, when I print the paramter where they are created (i.e. class create_modifiedCNN), I can see the content. However, when I pass the image through the network, the parameter is empty.