How can I use torch.nn.DataParallel while I'm doing transfer learning?

pouya.ahmadvand · September 21, 2020, 6:38pm

I’m trying to do my training on multiple GPUs by using the following code (the latest pytorch version):

from torchvision import models
model = model.vgg16(pretrained=True)
model.classifier._modules['6'] = torch.nn.Linear(4096, 10)
self.model = torch.nn.DataParallel(model, device_ids=[0,1,2]).cuda()
self.model = model.to(f'cuda:0')
...
def forward(self, input_data):
    output = self.model.forward(input_data)

I get this error when I call self.model.forward(input_data) :

  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torchvision/models/vgg.py", line 43, in forward
    x = self.features(x)
  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/poahmadvand/py3env/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

How can I fix this error? thanks.

pritamdamania87 · September 23, 2020, 3:23am

I can’t seem to reproduce the issue that you’re seeing. This code works fine on PyTorch 1.6:

from torchvision import models
import torch
model = models.vgg16(pretrained=True)
model.classifier._modules['6'] = torch.nn.Linear(4096, 10)
model = torch.nn.DataParallel(model, device_ids=[0,1,2]).cuda()
model = model.to(f'cuda:0')
input_data = torch.rand(10, 3, 225, 225)
model(input_data)

Am I missing something here?

pouya.ahmadvand · September 23, 2020, 5:33pm

Thanks, Do you have multiple GPUs on your machine?

pritamdamania87 · September 23, 2020, 9:01pm

Yes, I’m trying this on an 8 GPU machine.

pritamdamania87 · September 24, 2020, 1:42am

@pouya.ahmadvand Do you run into the same error even if you run the code I pasted above?

pouya.ahmadvand · September 24, 2020, 3:43am

Thanks, I’ll try it tomorrow and let you know.

pouya.ahmadvand · September 25, 2020, 7:07pm

@pritamdamania87 Thanks, it now works. There problem was that I used a function to fetch the number of GPUs with the highest free memory available, and this function returns an array like [5, 6, 7]. Then:

selected_gpus = [5, 6, 7]
model = torch.nn.DataParallel(model, device_ids=selected_gpus).cuda()   
model = model.to(f'cuda:{selected_gpus[0]}')

which gives me that error. now I’m using the return array of gpu selector and set the system variable CUDA_VISIBLE_DEVICES by it and then:

selected_gpus = [5, 6, 7]
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in selected_gpus)
model = torch.nn.DataParallel(model, device_ids=range(0,len(selected_gpus)).cuda()
model = model.to(f'cuda:0')

It works fine now.
Thanks!