Segmentation fault (core dumped) on backwards call

Matheusih · December 6, 2018, 9:13pm

Hey there,

I’m getting a segmentation fault error when I try to initialize the weights of WideResNet. I iterate over the convolutional layers of my model and manually initialize the weights. Segmentation fault occurs when I reach layer 20, which is a Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False). When my script reaches this layer, gpu memory spikes from 1.6gb to 5.6gb. Layer 19 is a 320x640, so I don’t think this should be happening.

I attached gdb with the following commands:

gdb python3 
r main.py
where

and the output is:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffd4e21362 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555e59bf560)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:149
149 /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h: Arquivo ou diretório não encontrado.
(gdb) where
#0  0x00007fffd4e21362 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555e59bf560)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:149
#1  0x00007fffd4ec788e in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5555e59bfee8, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
#2  std::__shared_ptr<torch::autograd::Function, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5555e59bfee0, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
#3  std::shared_ptr<torch::autograd::Function>::~shared_ptr (
    this=0x5555e59bfee0, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
#4  torch::autograd::Edge::~Edge (this=0x5555e59bfee0,
    __in_chrg=<optimized out>) at /pytorch/torch/csrc/autograd/edge.h:14
#5  std::_Destroy<torch::autograd::Edge> (__pointer=0x5555e59bfee0)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:93
#6  std::_Destroy_aux<false>::__destroy<torch::autograd::Edge*> (
    __last=<optimized out>, __first=0x5555e59bfee0)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:103
#7  std::_Destroy<torch::autograd::Edge*> (__last=<optimized out>,
    __first=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:126
#8  std::_Destroy<torch::autograd::Edge*, torch::autograd::Edge> (
    __last=0x5555e59bff10, __first=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:151
#9  std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> >::~vector (this=0x5555e59bfff8, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_vector.h:4---Type <return> to continue, or q <return> to quit---
24
#10 torch::autograd::Function::~Function (this=0x5555e59bffd0,
    __in_chrg=<optimized out>)
    at /pytorch/torch/csrc/autograd/function.h:100
#11 0x00007fffd4e21365 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555e59bffc0)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:149
#12 0x00007fffd4ec788e in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5555e59c0948, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
#13 std::__shared_ptr<torch::autograd::Function, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5555e59c0940, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
#14 std::shared_ptr<torch::autograd::Function>::~shared_ptr (
    this=0x5555e59c0940, __in_chrg=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
#15 torch::autograd::Edge::~Edge (this=0x5555e59c0940,
    __in_chrg=<optimized out>) at /pytorch/torch/csrc/autograd/edge.h:14
#16 std::_Destroy<torch::autograd::Edge> (__pointer=0x5555e59c0940)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:93
#17 std::_Destroy_aux<false>::__destroy<torch::autograd::Edge*> (
    __last=<optimized out>, __first=0x5555e59c0940)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:103
#18 std::_Destroy<torch::autograd::Edge*> (__last=<optimized out>,
    __first=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.h:126
#19 std::_Destroy<torch::autograd::Edge*, torch::autograd::Edge> (
    __last=0x5555e59c0970, __first=<optimized out>)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_construct.---Type <return> to continue, or q <return> to quit---

It continues down over #10000, so I don’t it would be of any help continuing… But if it might, I’ll edit here.
I get segmentation fault when I run the following code in a very large convolutional layers, e.g. 640x640,:
edit: code to reproduce:

for j,k in enumerate(conv_layers): # conv_layers is a list all Conv2d in my model
    sum_dist = 0
    for target in k.weight:
        for _filter in k.weight:
            sub = target - _filter
            
            dist = torch.pow(sub + 1e-07,2)
            dist = torch.sum(dist)
            dist = torch.sqrt(dist)

            sum_dist = sum_dist + dist
            
    filter_loss = sum_dist
    filter_loss *= gamma # usually = 0.1
    filter_loss = - filter_loss # negate to maximize

    filter_loss.backward()
    
    kernel_optimizer.step()

I also tried different models such as resnet101 or resnet18. I get the same behaviour when reaching large layers, but this code works perfectly for smaller models such as AlexNet.
I’m guessing is something related to pytorch having to deal with pows and sqrts?

I’m using PyTorch 0.4.0v in Ubuntu 16.04 x64. I also tried in another machine with the same pytorch and os version and the outcome is the same.

Any help is appreciated