How can I replace the convolution of a parallel model with a custom convolution?

I want to use custom convolutional layers, but I don’t want to modify the model definition file (such as resnet.py), instead I want to override the forward method of each convolutional layer for a given nn.Module.
I think the apply function seems to be helpful for my purpose. My implementation is as follows:

def channel_pruning(m):

def new_forward(self, x):
    y = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
    k = int(m.out_channels * m.rate)
    if k > 0:
        s = F.adaptive_avg_pool2d(torch.abs(x), (1, 1)).view(x.size()[0], -1)
        g = F.relu(self.gate(s))
        i = (-g).topk(k, 1)[1]
        t = g.scatter(1, i, 0)
        t = t / torch.sum(t, dim=1).unsqueeze(1) * self.out_channels
        y = y * t.unsqueeze(2).unsqueeze(3)
    return y

name = m.__class__.__name__
if 'Conv' in name:
    m.rate = args.rate
    m.gate = nn.Linear(in_features=m.in_channels, out_features=m.out_channels, bias=True).to(device)
    nn.init.constant_(m.gate.bias, 1)
    nn.init.kaiming_normal_(m.gate.weight)
    m.forward = types.MethodType(new_forward, m)

model.apply(channel_pruning)

This code runs fine under CPU and single GPU. Unfortunately, when I try to parallelize the model, the program reports an error “RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)”
The reason for the error is obvious, the model and the input are assigned to different GPUs. What should I do to make this code support multi-GPU parallelism?

I guess the to(device) is causing the issues, since device seems to be a global variable.
Try to use the .device attribute of a local parameter or buffer of the used module and create the newly initialized layer on this device.

I have fixed this bug by replacing DataParallel to DistributedDataParallel.