In my layer class, there is a value tensor, an index tensor, and a kernel tensor. In the forward function I use scatter_add to add the value to kernel according to the index. Then the kernel is used as the convolution kernel to perform convolution, my layer class looks like this:
class MyLayer(nn.Module): def __init__(self, C_in, C_out): super(MyLayer, self).__init__() self.C_in = C_in self.C_out = C_out self.value = nn.Parameter(...) self.register_buffer('inds', ...) self.register_buffer('kernel', torch.zeros(self.C_in * self.C_out * 1 * 1)) def forward(self, x, p): value = self.value * p kernel = self.kernel.scatter_add(0, self.inds, value) kernel = kernel.view(self.C_in, self.C_out, 1, 1) out = F.conv2d(x, kernel, stride=1) return out
However, when I wrap my network with nn.DataParallel and training on 2 GPUs, I observe a doubled forward time compared with single GPU. Could someone tell me why my layer becomes even slower with multi-GPU, and how to modify it to work with nn.DataParallel?