DataParallel and Conv2D

Hello,

I am trying to train a VGG model using DataParallel on a new computer with multiple GPUs. Before, my code was working properly when training on my other computer with a single GPU. I have changed the code to include:

class MyDataParallel(nn.DataParallel)  :

    def __getattr__(self, name):
        try:
            return super().__getattr__(name)
        except AttributeError:
            return getattr(self.module, name)
...
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        net = MyDataParallel(net, device_ids=range(torch.cuda.device_count()))
    net.to(device)

where the top subclass being created is to just allow my custom forward method to access model attributes (I am training the model on CIFAR100 instead of Imagenet, therefore I change the FC layers to 1 FC layer and I also don’t need the averagepool). With these changes, I get this error:

return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

When looking at the library files for torch.nn.modules.conv, I noticed that the forward function for conv2d is different than conv1d and conv3d

    def forward(self, input: Tensor) -> Tensor:
        if self.padding_mode != 'zeros':
            return F.conv1d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
                            self.weight, self.bias, self.stride,
                            _single(0), self.dilation, self.groups)
        return F.conv1d(input, self.weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)
...

    def _conv_forward(self, input, weight):
        if self.padding_mode != 'zeros':
            return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
                            weight, self.bias, self.stride,
                            _pair(0), self.dilation, self.groups)
        return F.conv2d(input, weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)
...

    def forward(self, input: Tensor) -> Tensor:
        if self.padding_mode != 'zeros':
            return F.conv3d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
                            self.weight, self.bias, self.stride, _triple(0),
                            self.dilation, self.groups)
        return F.conv3d(input, self.weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)

Is there a particular reason why conv2d is the only one that uses weight instead of self.weight in it’s forward function? I feel like that might be the reason why DataParallel is erroring during runtime. I am using Spyder as an IDE, and I have already put breakpoints in dataparallel.py. I can confirm that when the code hits

replicas = self.replicate(self.module, self.device_ids[:len(inputs)])

that the individual replicas are on their correct devices, and that each of the model’s conv layers have a weight tensor that exists on their proper device, but for some reason the forward method keeps trying to get the weight tensor on device 0

The conv2d library was not the problem. I found out problem was listed here : Since I was running VGG on cifar100, I had to rewrite the forward method on pytorch’s default VGG network since its built for ImageNet and includes a averagepool layer that will error with cifar100’s data size. Using types.MethodType to replace methods in a network is incompatible with DataParallel. My solution was to create my own “MyVGG” class that takes a VGG model as an input and takes all of its parameters, and then I could write my own forward function within that class.