Training becomes slower when using nn.DataParallel with custom convolution layer

Mr.Z · June 3, 2019, 9:49am

In my layer class, there is a value tensor, an index tensor, and a kernel tensor. In the forward function I use scatter_add to add the value to kernel according to the index. Then the kernel is used as the convolution kernel to perform convolution, my layer class looks like this:

class MyLayer(nn.Module):
  def __init__(self, C_in, C_out):
    super(MyLayer, self).__init__()

    self.C_in = C_in
    self.C_out = C_out

    self.value = nn.Parameter(...)
    self.register_buffer('inds', ...)
    self.register_buffer('kernel', torch.zeros(self.C_in * self.C_out * 1 * 1))

  def forward(self, x, p):
    value = self.value * p
    kernel = self.kernel.scatter_add(0, self.inds, value)
    kernel = kernel.view(self.C_in, self.C_out, 1, 1)

    out = F.conv2d(x, kernel, stride=1)

    return out

However, when I wrap my network with nn.DataParallel and training on 2 GPUs, I observe a doubled forward time compared with single GPU. Could someone tell me why my layer becomes even slower with multi-GPU, and how to modify it to work with nn.DataParallel?

pietern · June 24, 2019, 6:21am

Nothing in the code you pasted looks particularly slow. Perhaps the larger model you’re using contains many small layers/kernels? The nn.DataParallel wrapper replicates a module to N devices and runs forward on each of them. This overhead can dominate the runtime if your model is very small or has many very small kernels.

Mr.Z · June 24, 2019, 6:53am

Thanks for your advice. Yes, there are many small kernels in my network, actually I found it extremely slow when the batch size is small. Looks like with small batch size the runtime of scatter_add becomes the bottleneck (which can not be accelerated by nn.DataParallel), and when the batch size increases the nn.DataParallel begins to speed up training.