Training becomes slower when using nn.DataParallel with custom convolution layer

In my layer class, there is a value tensor, an index tensor, and a kernel tensor. In the forward function I use scatter_add to add the value to kernel according to the index. Then the kernel is used as the convolution kernel to perform convolution, my layer class looks like this:

class MyLayer(nn.Module):
  def __init__(self, C_in, C_out):
    super(MyLayer, self).__init__()

    self.C_in = C_in
    self.C_out = C_out

    self.value = nn.Parameter(...)
    self.register_buffer('inds', ...)
    self.register_buffer('kernel', torch.zeros(self.C_in * self.C_out * 1 * 1))

  def forward(self, x, p):
    value = self.value * p
    kernel = self.kernel.scatter_add(0, self.inds, value)
    kernel = kernel.view(self.C_in, self.C_out, 1, 1)

    out = F.conv2d(x, kernel, stride=1)

    return out

However, when I wrap my network with nn.DataParallel and training on 2 GPUs, I observe a doubled forward time compared with single GPU. Could someone tell me why my layer becomes even slower with multi-GPU, and how to modify it to work with nn.DataParallel?

Nothing in the code you pasted looks particularly slow. Perhaps the larger model you’re using contains many small layers/kernels? The nn.DataParallel wrapper replicates a module to N devices and runs forward on each of them. This overhead can dominate the runtime if your model is very small or has many very small kernels.

Thanks for your advice. Yes, there are many small kernels in my network, actually I found it extremely slow when the batch size is small. Looks like with small batch size the runtime of scatter_add becomes the bottleneck (which can not be accelerated by nn.DataParallel), and when the batch size increases the nn.DataParallel begins to speed up training.