Second-order gradient cause segfault over time

Paralysis · January 27, 2020, 6:58pm

Hello, I’m running a GAN model with some second-order regularization. Here’s the actual code for the grad calculation:

if pl_reg:
    pl_dlatents.requires_grad_(True)
    pl_fake = self.G(None, dlatents=pl_dlatents) # pass through the generator
    pl_noise = pl_fake.new(pl_fake.shape).normal_() / 1024
    pl_grads = autograd.grad(
        outputs=(pl_fake * pl_noise).sum(),
        inputs=pl_dlatents,
        grad_outputs=None,
        create_graph=True,
        retain_graph=True,
        only_inputs=True,
    )[0]
    pl_lengths = pl_grads.pow(2).sum(1).mul(1/self.G.get_num_layers()).sqrt()
    return pl_lengths

The training went well at beginning and the loss decreased. However, after a random period of time, ranging from 3000-8000 updates, the segfault occurs and the training was terminated automatically.

I’m using the official Docker images for training. The error is different between PyTorch 1.3 and 1.4.

In 1.3 it shows:

*** Error in `python3’: double free or corruption (fasttop): 0x00007f9e280c3fe0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fa1793127e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fa17931b37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fa17931f53c]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d820e)[0x7fa13bb9420e]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d82b9)[0x7fa13bb942b9]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d8435)[0x7fa13bb94435]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine17evaluate_functionERNS0_8NodeTaskE+0x1210)[0x7fa13bb8bb50]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE+0x1c4)[0x7fa13bb8da04]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN5torch8autograd6python12PythonEngine11thread_initEi+0x2a)[0x7fa16a537eda]
/opt/conda/lib/python3.6/site-packages/torch/…/…/…/libstdc++.so.6(+0xc819d)[0x7fa169ffb19d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fa17966c6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fa1793a241d]

and followed by a long memory map.

In 1.4 it shows:

free(): invalid pointer

without other information.

I only apply the regularization every 8 updates, and all the segfault errors happened when the # of updates is dividable by 8. That is the reason I assume it is the problem from second-order grad.

I’m using 4 Titan x pascal for data-parallel training. The above code happens inside a DataParallel module and results are gathered afterward, followed by a loss to minimize pl_lengths. In the generator design, I have F.conv() with groups, and both the input and weights are calculated from other up-stream networks. That’s different from an ordinary ConvNet:

class ModulatedConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, hidden_channels, kernel_size=3, stride=1, padding=1, dilation=1,
                 noisy=True, randomize_noise=True, up=False, demodulize=True, gain=1, lrmul=1):
        super(ModulatedConv2d, self).__init__()
        assert kernel_size >= 1 and kernel_size % 2 == 1
        self.noisy = noisy
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.randomize_noise = randomize_noise
        self.up = up
        self.demodulize = demodulize
        self.lrmul = lrmul

        # Get weight.
        fan_in = in_channels * kernel_size * kernel_size
        self.runtime_coef = gain / math.sqrt(fan_in) * math.sqrt(lrmul)
        self.weight = Parameter(torch.randn(out_channels, in_channels, kernel_size, kernel_size) / math.sqrt(lrmul), requires_grad=True) # [OIkk]

        # Get bias.
        self.bias = Parameter(torch.zeros(1, out_channels, 1, 1), requires_grad=True)

        # Modulate layer.
        self.mod = ScaleLinear(hidden_channels, in_channels, bias=True) # [BI] Transform incoming W to style.

        # Noise scale.
        if noisy:
            self.noise_scale = Parameter(torch.zeros(1), requires_grad=True)

    def forward(self, x, y, noise=None):
        w = self.weight * self.runtime_coef
        ww = w[np.newaxis] # [BOIkk] Introduce minibatch dimension.

        # Modulate.
        s = self.mod(y) + 1 # [BI] Add bias (initially 1).
        ww = ww * s[:, np.newaxis, :, np.newaxis, np.newaxis] # [BOIkk] Scale input feature maps.

        # Demodulate.
        if self.demodulize:
            d = torch.rsqrt(ww.pow(2).sum(dim=(2,3,4), keepdim=True) + 1e-8) # [BOIkk] Scaling factor.
            ww = ww * d # [BOIkk] Scale output feature maps.

        # Reshape/scale input.
        B = y.size(0)
        x = x.view(1, -1, *x.shape[2:]) # Fused [BIhw] => reshape minibatch to convolution groups [1(BI)hw].
        w = ww.view(-1, *ww.shape[2:]) # [(BO)Ikk]

        # Convolution with optional up/downsampling.
        if self.up: x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=False)
        x = F.conv2d(x, w, None, self.stride, self.padding, self.dilation, groups=B) # [1(BO)hw]

        # Reshape/scale output.
        x = x.view(B, -1, *x.shape[2:]) # [BOhw]

        # Apply noise and bias
        if self.noisy:
            if self.randomize_noise: noise = x.new_empty(B, 1, *x.shape[2:]).normal_()
            x += noise * self.noise_scale
        x += self.bias * self.lrmul
        return x

Sorry for not able to find a smaller sample code to reproduce the error. This error happens every time but randomly over time. It also took a long time to happen. Let me know if you need any additional information.

Thank you very much for your time!

ezyang · January 27, 2020, 6:59pm

Thanks. Can you file a bug on the GitHub issue tracker about this? If you can include the whole reproducer in executable form that would be quite helpful.