Hello, I’m running a GAN model with some second-order regularization. Here’s the actual code for the grad calculation:
if pl_reg:
pl_dlatents.requires_grad_(True)
pl_fake = self.G(None, dlatents=pl_dlatents) # pass through the generator
pl_noise = pl_fake.new(pl_fake.shape).normal_() / 1024
pl_grads = autograd.grad(
outputs=(pl_fake * pl_noise).sum(),
inputs=pl_dlatents,
grad_outputs=None,
create_graph=True,
retain_graph=True,
only_inputs=True,
)[0]
pl_lengths = pl_grads.pow(2).sum(1).mul(1/self.G.get_num_layers()).sqrt()
return pl_lengths
The training went well at beginning and the loss decreased. However, after a random period of time, ranging from 3000-8000 updates, the segfault occurs and the training was terminated automatically.
I’m using the official Docker images for training. The error is different between PyTorch 1.3 and 1.4.
In 1.3 it shows:
*** Error in `python3’: double free or corruption (fasttop): 0x00007f9e280c3fe0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fa1793127e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fa17931b37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fa17931f53c]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d820e)[0x7fa13bb9420e]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d82b9)[0x7fa13bb942b9]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x39d8435)[0x7fa13bb94435]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine17evaluate_functionERNS0_8NodeTaskE+0x1210)[0x7fa13bb8bb50]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE+0x1c4)[0x7fa13bb8da04]
/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN5torch8autograd6python12PythonEngine11thread_initEi+0x2a)[0x7fa16a537eda]
/opt/conda/lib/python3.6/site-packages/torch/…/…/…/libstdc++.so.6(+0xc819d)[0x7fa169ffb19d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fa17966c6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fa1793a241d]
and followed by a long memory map.
In 1.4 it shows:
free(): invalid pointer
without other information.
I only apply the regularization every 8 updates, and all the segfault errors happened when the # of updates is dividable by 8. That is the reason I assume it is the problem from second-order grad.
I’m using 4 Titan x pascal for data-parallel training. The above code happens inside a DataParallel module and results are gathered afterward, followed by a loss to minimize pl_lengths. In the generator design, I have F.conv() with groups, and both the input and weights are calculated from other up-stream networks. That’s different from an ordinary ConvNet:
class ModulatedConv2d(nn.Module):
def __init__(self, in_channels, out_channels, hidden_channels, kernel_size=3, stride=1, padding=1, dilation=1,
noisy=True, randomize_noise=True, up=False, demodulize=True, gain=1, lrmul=1):
super(ModulatedConv2d, self).__init__()
assert kernel_size >= 1 and kernel_size % 2 == 1
self.noisy = noisy
self.stride = stride
self.padding = padding
self.dilation = dilation
self.randomize_noise = randomize_noise
self.up = up
self.demodulize = demodulize
self.lrmul = lrmul
# Get weight.
fan_in = in_channels * kernel_size * kernel_size
self.runtime_coef = gain / math.sqrt(fan_in) * math.sqrt(lrmul)
self.weight = Parameter(torch.randn(out_channels, in_channels, kernel_size, kernel_size) / math.sqrt(lrmul), requires_grad=True) # [OIkk]
# Get bias.
self.bias = Parameter(torch.zeros(1, out_channels, 1, 1), requires_grad=True)
# Modulate layer.
self.mod = ScaleLinear(hidden_channels, in_channels, bias=True) # [BI] Transform incoming W to style.
# Noise scale.
if noisy:
self.noise_scale = Parameter(torch.zeros(1), requires_grad=True)
def forward(self, x, y, noise=None):
w = self.weight * self.runtime_coef
ww = w[np.newaxis] # [BOIkk] Introduce minibatch dimension.
# Modulate.
s = self.mod(y) + 1 # [BI] Add bias (initially 1).
ww = ww * s[:, np.newaxis, :, np.newaxis, np.newaxis] # [BOIkk] Scale input feature maps.
# Demodulate.
if self.demodulize:
d = torch.rsqrt(ww.pow(2).sum(dim=(2,3,4), keepdim=True) + 1e-8) # [BOIkk] Scaling factor.
ww = ww * d # [BOIkk] Scale output feature maps.
# Reshape/scale input.
B = y.size(0)
x = x.view(1, -1, *x.shape[2:]) # Fused [BIhw] => reshape minibatch to convolution groups [1(BI)hw].
w = ww.view(-1, *ww.shape[2:]) # [(BO)Ikk]
# Convolution with optional up/downsampling.
if self.up: x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=False)
x = F.conv2d(x, w, None, self.stride, self.padding, self.dilation, groups=B) # [1(BO)hw]
# Reshape/scale output.
x = x.view(B, -1, *x.shape[2:]) # [BOhw]
# Apply noise and bias
if self.noisy:
if self.randomize_noise: noise = x.new_empty(B, 1, *x.shape[2:]).normal_()
x += noise * self.noise_scale
x += self.bias * self.lrmul
return x
Sorry for not able to find a smaller sample code to reproduce the error. This error happens every time but randomly over time. It also took a long time to happen. Let me know if you need any additional information.
Thank you very much for your time!