In my layer class, there is a value tensor, an index tensor, and a kernel tensor. In the forward function I use scatter_add to add the value to kernel according to the index. Then the kernel is used as the convolution kernel to perform convolution, my layer class looks like this:
class MyLayer(nn.Module):
def __init__(self, C_in, C_out):
super(MyLayer, self).__init__()
self.C_in = C_in
self.C_out = C_out
self.value = nn.Parameter(...)
self.register_buffer('inds', ...)
self.register_buffer('kernel', torch.zeros(self.C_in * self.C_out * 1 * 1))
def forward(self, x, p):
value = self.value * p
kernel = self.kernel.scatter_add(0, self.inds, value)
kernel = kernel.view(self.C_in, self.C_out, 1, 1)
out = F.conv2d(x, kernel, stride=1)
return out
However, when I wrap my network with nn.DataParallel and training on 2 GPUs, I observe a doubled forward time compared with single GPU. Could someone tell me why my layer becomes even slower with multi-GPU, and how to modify it to work with nn.DataParallel?