Segmentation Fault bias initialisation Conv2d

Hi!,
I face a problem using pytorch 1.3.0 on Cuda V100. Here the code originating from


and associated paper https://arxiv.org/pdf/1608.03981.pdf

class DnCNN(nn.Module):
def __init__(self, depth=17, n_channels=64, image_channels=1, use_bnorm=True, kernel_size=3):
    super(DnCNN, self).__init__()
    kernel_size = 3
    padding = 1
    layers = []

    layers.append(nn.Conv2d(in_channels=image_channels, out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=True))
    layers.append(nn.ReLU(inplace=True))
    for _ in range(depth-2):
        layers.append(nn.Conv2d(in_channels=n_channels, out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=False))
        layers.append(nn.BatchNorm2d(n_channels, eps=0.0001, momentum = 0.95))
        layers.append(nn.ReLU(inplace=True))
    layers.append(nn.Conv2d(in_channels=n_channels, out_channels=image_channels, kernel_size=kernel_size, padding=padding, bias=False))
    self.dncnn = nn.Sequential(*layers)
    self._initialize_weights()
    print("DnCNN init done")

def forward(self, x):
    y = x
    out = self.dncnn(x)
    return y-out

def _initialize_weights(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            print('Conv init weight...',m.weight.size())
            init.orthogonal_(m.weight)
            print('Ok')
            if m.bias is not None:
                print('Conv init bias...')
                init.constant_(m.bias, 0)
                print('Ok')
        elif isinstance(m, nn.BatchNorm2d):
            print('BN init weight...')
            init.constant_(m.weight, 1)
            print('BN init bias...')
            init.constant_(m.bias, 0)
            print('Ok')

Use device…: cuda
Conv init weight… torch.Size([64, 1, 3, 3])
Ok
m.bias= Parameter containing:
tensor([-0.2753, 0.0797, -0.0850, -0.0789, -0.1403, 0.2473, 0.2015, -0.2147,
-0.1405, -0.1591, -0.0177, 0.2169, 0.3185, -0.2955, -0.3116, 0.1439,
0.2683, 0.2349, 0.2002, -0.0572, 0.2871, 0.1560, -0.2910, 0.1999,
0.2363, -0.0208, -0.0093, -0.2994, 0.1569, -0.0401, -0.3037, -0.2558,
-0.3046, -0.2971, 0.1851, 0.1453, -0.1999, 0.1158, 0.2158, -0.2221,
0.0930, 0.3183, -0.1261, -0.0886, -0.1297, 0.0019, 0.0564, -0.0134,
0.1727, 0.0585, 0.1753, -0.2736, 0.0683, 0.1069, -0.0181, 0.0422,
-0.2124, -0.1882, -0.1084, 0.2899, 0.1648, 0.1981, -0.0342, -0.1585],
requires_grad=True)
Conv init bias…
Ok
Conv init weight… torch.Size([64, 64, 3, 3])
Segmentation fault

So, the Segmentation fault is rised by the second Conv2d layer. Any idea?

In general, you should never see segfaults :slight_smile:

Can you check that it still happens with the nightly builds?
It would be nice to have a small code sample that reproduces the issue. Can you try to remove the dataloading, special logging/data prepreocessing, layers from your net etc until the error disapears? That would really help if we had a small code sample to reproduce this locally. Thanks !

Great! @albanD here the minimal code very simple: it defines the DnCNN networks (the one in https://github.com/cszn/DnCNN/blob/master/TrainingCodes/dncnn_pytorch/main_train.py) and then instantiate it. The code is below but cannot succeed to get nice print

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as init
    
class DnCNN(nn.Module):
    def __init__(self, depth=17, n_channels=64, image_channels=1, use_bnorm=True, kernel_size=3):
        super(DnCNN, self).__init__()
        kernel_size = 3
        padding = 1
        layers = []

        layers.append(nn.Conv2d(in_channels=image_channels, out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=True))
        layers.append(nn.ReLU(inplace=True))
        for _ in range(depth-2):
            layers.append(nn.Conv2d(in_channels=n_channels, out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=False))
            layers.append(nn.BatchNorm2d(n_channels, eps=0.0001, momentum = 0.95))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.Conv2d(in_channels=n_channels, out_channels=image_channels, kernel_size=kernel_size, padding=padding, bias=False))
        self.dncnn = nn.Sequential(*layers)
        self._initialize_weights()
        print("DnCNN init done")

    def forward(self, x):
        y = x
        out = self.dncnn(x)
        return y-out

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                print('Conv init weight...',m.weight.size())
                init.orthogonal_(m.weight)
                print('Ok')
                print('m.bias=', m.bias)
                if m.bias is not None:
                    print('Conv init bias...')
                    init.constant_(m.bias, 0)
                    print('Ok')
            elif isinstance(m, nn.BatchNorm2d):
                print('BN init weight...')
                init.constant_(m.weight, 1)
                print('BN init bias...')
                init.constant_(m.bias, 0)
                print('Ok')

model = DnCNN()

I just tried and this runs fine on my computer.

Where does it fail exactly for you?
How do you install pytorch?

Hi!
I’ve install pytorch using pip installed via anaconda3, my python is 3.6.5. The machine is a Platform: CentOS 7.7.1908
Architecture: x86_64

Now, where it crashes exactly is (looking at the log in my post above) is at the second Conv2d initialisation, ie the first one pass the init weight and bias. Please note that the first Conv2d has bias=True, while the others have bias=False, and to init the bias there is a test if m.bias is not None:

A complement:

  1. If I set depth=2, then the loop for _ in range(depth-2): is not executed: no segfault
  2. If I set depth=3, the loop executed once, if I comment the line layers.append(nn.Conv2d(...)), then the init goes well through the BN init: no segfault
  3. Then If I comment the creation of the BN and Relu layers but decomment the Conv2d layer, it goes to segFault

So it is really the line layers.append(nn.Conv2d(in_channels=n_channels, out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=False))which makes trouble. Notice that n_channels=64, kernel_size = 3 and padding = 1.

So, the minimal code that goes to segFault is

class DnCNN(nn.Module):
    def __init__(self, depth=17, n_channels=64, image_channels=1, use_bnorm=True, kernel_size=3): #depth=17 by default
        super(DnCNN, self).__init__()
        kernel_size = 3
        padding = 1
        layers = []

        self.dncnn = nn.Sequential(nn.Conv2d(in_channels=image_channels,
                                             out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=True),
                                   nn.ReLU(inplace=True),
                                   nn.Conv2d(in_channels=n_channels,
                                             out_channels=n_channels, kernel_size=kernel_size, padding=padding, bias=False),
                                   nn.BatchNorm2d(n_channels, eps=0.0001, momentum = 0.95),
                                   nn.ReLU(inplace=True),
                                   nn.Conv2d(in_channels=n_channels,
                                             out_channels=image_channels, kernel_size=kernel_size, padding=padding, bias=False)
                                   )

        self._initialize_weights()
        print("DnCNN init done")

    def forward(self, x):
        y = x
        out = self.dncnn(x)
        return y-out

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                print('Conv init weight...',m.weight.size())
                init.orthogonal_(m.weight)
                print('Ok')
                print('m.bias=', m.bias)
                if m.bias is not None:
                    print('Conv init bias...')
                    init.constant_(m.bias, 0)
                    print('Ok')
            elif isinstance(m, nn.BatchNorm2d):
                print('BN init weight...')
                init.constant_(m.weight, 1)
                print('BN init bias...')
                init.constant_(m.bias, 0)
                print('Ok')

####
model = DnCNN()
####

Have you tried the conda package since you use anaconda?
Do you know how to use gdb to get a stack trace in cpp? As I cannot reproduce, I cannot get it myself :confused:

Which package do you mean torch? I only know how to use python <myscript.py> which imports the packages
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as init

Now, I use from times to times gdb (trace back) for my C++ applications but not for python code…

Which package do you mean torch?

You can check the getting started section and choose conda instead of pip.

Now, I use from times to times gdb (trace back) for my C++ applications but not for python code…

You can do the exact same thing by running python. You do first gdb python then in gdb r you_script.py. Please post the stack trace here !

Here it is:

(gdb) r segfault.py
Starting program: <not-shown>/anaconda3/bin/python segfault.py
warning: Unable to open "librpm.so.3" <not-shown>/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5c/c1a53b747a7e4d21198723c2b633e54f3c06d9.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/8b/33f7f8c86f8d544c63c5541a8e42b3ddfef8b1.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/39/8944d32cf16a67af51067a326e6c0cc14f90ed.debug
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/18/113e6e83d8e981b8e8d808f7f3dbb23f950a1d.debug
Missing separate debuginfo for /lib64/libutil.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e0/d39e293dc99997e7b4c9b6203301e6cd904b50.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/47/49697bf078337576c4629f0d30b296a0939779.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/56/81c054fdabcf789f4dda66e94f1f6ed1747327.debug
Missing separate debuginfo for <not shown/anaconda3/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Detaching after fork from child process 22253.
Missing separate debuginfo for <not shown>/anaconda3/lib/python3.6/site-packages/torch/lib/libgomp-7c85b1e2.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/0c/0cf5452435739ebddf30a881ede53634718e67.debug
Conv init weight... torch.Size([64, 1, 3, 3])
[New Thread 0x7fff985eb780 (LWP 22264)]
[New Thread 0x7fff93ffe800 (LWP 22265)]
[New Thread 0x7fff93bfc880 (LWP 22266)]
[New Thread 0x7fff937fa900 (LWP 22267)]
[New Thread 0x7fff933f8980 (LWP 22268)]
[New Thread 0x7fff92ff6a00 (LWP 22269)]
[New Thread 0x7fff92bf4a80 (LWP 22270)]
[New Thread 0x7fff927f2b00 (LWP 22271)]
[New Thread 0x7fff923f0b80 (LWP 22272)]
[New Thread 0x7fff91feec00 (LWP 22273)]
[New Thread 0x7fff91becc80 (LWP 22274)]
[New Thread 0x7fff917ead00 (LWP 22275)]
[New Thread 0x7fff913e8d80 (LWP 22276)]
[New Thread 0x7fff90fe6e00 (LWP 22277)]
[New Thread 0x7fff90be4e80 (LWP 22278)]
Ok
m.bias= Parameter containing:
tensor([ 0.2156, -0.2471, -0.2311,  0.2540,  0.3309, -0.2287, -0.0188,  0.0131,
        -0.2803, -0.1206,  0.0137,  0.0690,  0.2866, -0.0905,  0.1895, -0.2293,
        -0.2868, -0.0857,  0.0859, -0.0705, -0.2306,  0.1190,  0.2816, -0.2630,
        -0.1347,  0.1569,  0.0185,  0.1192,  0.1095, -0.1121,  0.2065, -0.1344,
         0.1836, -0.0868,  0.2133,  0.2493, -0.0732,  0.0236,  0.0594, -0.2189,
        -0.2067,  0.0991, -0.2000, -0.2467, -0.1080,  0.0670, -0.3130,  0.0110,
         0.1370,  0.3200, -0.0749,  0.2327,  0.0353, -0.0825,  0.1939,  0.0394,
        -0.1851, -0.1426,  0.3325,  0.1034,  0.2023, -0.3244,  0.0966, -0.2637],
       requires_grad=True)
Conv init bias...
Ok
Conv init weight... torch.Size([64, 64, 3, 3])

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000008 in ?? ()

Can you type bt to get the backtrace after it received the segfault please?

Here it is …

#0  0x0000000000000008 in ?? ()
#1  0x00007fffeeed1ad9 in gemm_omp_driver_v2 ()
   from <>/anaconda3/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#2  0x00007fffeeed0ae2 in mkl_blas_sgemm ()
   from <>/anaconda3/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#3  0x00007fffeb49de1f in mkl_lapack_slarfb ()
   from <>/anaconda3/lib/python3.6/site-packages/mkl_fft/../../../libmkl_core.so
#4  0x00007fffef883398 in mkl_lapack_sorgqr ()
   from <>/anaconda3/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#5  0x00007ffff109c808 in sorgqr_ ()
   from <>/anaconda3/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_lp64.so
#6  0x00007fff9fcb0bc2 in at::native::_qr_helper_cpu(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#7  0x00007fffa012c9e5 in at::CPUType::(anonymous namespace)::_qr_helper(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#8  0x00007fffa0154520 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<std::tuple<at::Tensor, at::Tensor> (*)(at::Tensor const&, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, bool> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&, bool)>::call(c10::OperatorKernel*, at::Tensor const&, bool) () from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#9  0x00007fff9fca014a in at::_qr_helper(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#10 0x00007fff9fca04b7 in at::native::qr(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
---Type <return> to continue, or q <return> to quit---
#11 0x00007fffa0286f85 in at::TypeDefault::qr(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#12 0x00007fffa0154520 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<std::tuple<at::Tensor, at::Tensor> (*)(at::Tensor const&, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, bool> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&, bool)>::call(c10::OperatorKernel*, at::Tensor const&, bool) () from /sps/baoradio/JEC/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#13 0x00007fffa1e8b892 in torch::autograd::VariableType::(anonymous namespace)::qr(at::Tensor const&, bool) ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#14 0x00007fffa0154520 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<std::tuple<at::Tensor, at::Tensor> (*)(at::Tensor const&, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, bool> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&, bool)>::call(c10::OperatorKernel*, at::Tensor const&, bool) () from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so
#15 0x00007fffe9755d10 in std::result_of<std::tuple<at::Tensor, at::Tensor> c10::impl::OperatorEntry::callUnboxedOnly<std::tuple<at::Tensor, at::Tensor>, at::Tensor const&, bool>(c10::TensorTypeId, at::Tensor const&, bool) const::{lambda(c10::DispatchTable const&)#1} (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<std::tuple<at::Tensor, at::Tensor> c10::impl::OperatorEntry::callUnboxedOnly<std::tuple<at::Tensor, at::Tensor>, at::Tensor const&, bool>(c10::TensorTypeId, at::Tensor const&, bool) const::{lambda(c10::DispatchTable const&)#1}>(std::tuple<at::Tensor, at::Tensor> c10::impl::OperatorEntry::callUnboxedOnly<std::tuple<at::Tensor, at::Tensor>, at::Tensor const&, bool>(c10::TensorTypeId, at::Tensor const&, bool) const::{lambda(c10::DispatchTable const&)#1}&&) const ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#16 0x00007fffe962016c in torch::autograd::THPVariable_qr ()
   from <>/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#17 0x0000555555662b94 in _PyCFunction_FastCallDict ()
---Type <return> to continue, or q <return> to quit---
#18 0x00005555556f267c in call_function ()
#19 0x0000555555714cba in _PyEval_EvalFrameDefault ()
#20 0x00005555556eba94 in _PyEval_EvalCodeWithName ()
#21 0x00005555556ec941 in fast_function ()
#22 0x00005555556f2755 in call_function ()
#23 0x0000555555714cba in _PyEval_EvalFrameDefault ()
#24 0x00005555556ec70b in fast_function ()
#25 0x00005555556f2755 in call_function ()
#26 0x0000555555714cba in _PyEval_EvalFrameDefault ()
#27 0x00005555556ebc26 in _PyEval_EvalCodeWithName ()
#28 0x00005555556ece1b in _PyFunction_FastCallDict ()
#29 0x0000555555662f5f in _PyObject_FastCallDict ()
#30 0x0000555555667a03 in _PyObject_Call_Prepend ()
#31 0x000055555566299e in PyObject_Call ()
#32 0x00005555556bf02b in slot_tp_init ()
#33 0x00005555556f29b7 in type_call ()
#34 0x0000555555662d7b in _PyObject_FastCallDict ()
#35 0x00005555556f27ce in call_function ()
#36 0x0000555555714cba in _PyEval_EvalFrameDefault ()
#37 0x00005555556ed459 in PyEval_EvalCodeEx ()
#38 0x00005555556ee1ec in PyEval_EvalCode ()
#39 0x00005555557689a4 in run_mod ()
#40 0x0000555555768da1 in PyRun_FileExFlags ()
---Type <return> to continue, or q <return> to quit---
#41 0x0000555555768fa4 in PyRun_SimpleFileExFlags ()
#42 0x000055555576ca9e in Py_Main ()
#43 0x00005555556344be in main ()

It looks like it happens inside the mkl library.
I would suggest to get a clean install for both mkl and pytorch as this may be due to old or incompatible library.

Well, I guess I can reinstall pytorch as it is my version under anaconda, but as I am running on a Computer Center I pretty sure that mkl is an other affair.

The mkl used in the stack trace above is installed within anaconda. So that would be the one you installed (or that got automatically installed by other package).

Well, I have proceeded to uninstallation with pip of my torch 1.3.0 version, then I have used conda to install a fresh pytorch version (1.3.1). It comes with some updates of other packages mkl 2019.4-243 and other mkl stuff as well as others (numpy,…) and also python 3.6.5 to 3.7.5 :frowning: but at the end my segfault.py script works !!! incredible.
I think I can close the thread and thank you very much. :slight_smile:

Interesting. So I guess the pip-version linked to a wrong mkl version. Hence causing the issue !
In general, I would advise to use the conda install of pytorch if you’re in a conda environment. That will make sure you don’t have such issues !
Happy this is fixed.