Segmentation fault (core dumped). when I was using CUDA

prophet_zhan · June 15, 2020, 10:13am

My model can run slowly in cpu, but it cannot run in GPU.
When I was using CUDA(10.0.130), I will get Segmentation fault (core dumped).
So I try to use gdb python, and I got:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f231cdd9cc0 in _IO_vfprintf_internal (s=s@entry=0x7ffd3aee5f00, format=<optimized out>, format@entry=0x7f2319b6e4f0 "expected %s (got %s)", ap=ap@entry=0x7ffd3aee64a8)
    at vfprintf.c:1632
1632	vfprintf.c: No such file or directory.
(gdb) where
#0  0x00007f231cdd9cc0 in _IO_vfprintf_internal (s=s@entry=0x7ffd3aee5f00, format=<optimized out>, format@entry=0x7f2319b6e4f0 "expected %s (got %s)", ap=ap@entry=0x7ffd3aee64a8)
    at vfprintf.c:1632
#1  0x00007f231ce01a49 in _IO_vsnprintf (string=0x7ffd3aee6070 "expected \235+\373\263U", maxlen=<optimized out>, format=0x7f2319b6e4f0 "expected %s (got %s)", args=0x7ffd3aee64a8)
    at vsnprintf.c:114
#2  0x00007f231963d54d in torch::formatMessage(char const*, __va_list_tag*) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#3  0x00007f231963db11 in torch::TypeError::TypeError(char const*, ...) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#4  0x00007f23198c3a71 in torch::utils::(anonymous namespace)::new_with_tensor(c10::TensorTypeId, c10::ScalarType, at::Tensor const&) ()
   from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007f23198c5e20 in torch::utils::legacy_tensor_ctor(c10::TensorTypeId, c10::ScalarType, _object*, _object*) ()
   from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007f2319897a40 in torch::tensors::Tensor_new(_typeobject*, _object*, _object*) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#7  0x000055b361d92239 in _PyObject_FastCallKeywords ()
#8  0x000055b361dee6b2 in _PyEval_EvalFrameDefault ()
#9  0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#10 0x000055b361d3033c in _PyFunction_FastCallDict ()
#11 0x000055b361d46a03 in _PyObject_Call_Prepend ()
#12 0x000055b361d3b8d2 in PyObject_Call ()
#13 0x000055b361deb1ab in _PyEval_EvalFrameDefault ()
#14 0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#15 0x000055b361d3033c in _PyFunction_FastCallDict ()
#16 0x000055b361d46a03 in _PyObject_Call_Prepend ()
#17 0x000055b361d89baa in slot_tp_call ()
#18 0x000055b361d9261b in _PyObject_FastCallKeywords ()
#19 0x000055b361deea79 in _PyEval_EvalFrameDefault ()
#20 0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#21 0x000055b361d2ff24 in PyEval_EvalCodeEx ()
#22 0x000055b361d2ff4c in PyEval_EvalCode ()
#23 0x000055b361e48a14 in run_mod ()
#24 0x000055b361e51f11 in PyRun_FileExFlags ()
#25 0x000055b361e52104 in PyRun_SimpleFileExFlags ()
#26 0x000055b361e53bbd in pymain_main.constprop ()
#27 0x000055b361e53e30 in _Py_UnixMain ()
#28 0x00007f231cdab830 in __libc_start_main (main=0x55b361d0fd20 <main>, argc=2, argv=0x7ffd3aee7828, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd3aee7818) at ../csu/libc-start.c:291
#29 0x000055b361df9052 in _start () at ../sysdeps/x86_64/elf/start.S:103

So what should I change my code, or it is a pytorch bug?

albanD · June 15, 2020, 2:27pm

Hi,

That looks bad indeed.
The segfault happens while pytorch was trying to raise a Type Error when constructing a Tensor.
Do you have a small code sample that reproduces this behavior? I would be happy to take a closer look !

prophet_zhan · June 16, 2020, 2:13am

Thanks,

You are right. I try to construct a LongTensor that was already a Tensor.

albanD · June 16, 2020, 3:09pm

Do you have a repro? Because whatever you do. you should never get segfault
I would like to fix that if possible.

ptrblck · June 17, 2020, 7:11am

This issue might be related to this one.

@prophet_zhan are you using an older PyTorch version, as this should have been fixed already?

prophet_zhan · June 17, 2020, 7:30am

I am using 1.3.1 in GPU server (TitanV * 2). And I try to run a single-GPU module on multiple GPU.

My model can run now. Thanks.
@albanD That is what I changed:

        decoder_in, s, w = decoder_initial(x.size(0))
        decoder_in = y[:, 0]
        decoder_in_1 = decoder_in.to('cuda:1')

        # 1.7. for each decoder timestep
        for j in range(y.size(1) - 1):  # for all sequences
            """
            decoder_in (Variable): [b]
            encoded (Variable): [b x seq x hid]
            input_out (np.array): [b x seq]
            s (Variable): [b x hid]
            """
            # 1.7.1.1st state - create [out]
            if j == 0:
                h_out, c_out, h_add = midlstm(None, None, y=decoder_in, order=j, encoded=encoded)
                if torch.cuda.is_available():
                    h_add = h_add.to('cuda:1')

And in midlstm layer:

    def forward(self, h_0, c_0, y, order, encoded):

        y[y >= self.vocab_size] = 1  # y is narray
        b = encoded.size(0)  # batch size
        seq = encoded.size(1)  # input sequence length
        hidden_size = self.hidden_size

        # I delete
        #  `y = torch.LongTensor(y) `
        #  and it works
        att = self.Emb(y).unsqueeze(1)
        # In __init__: self.Emb = nn.Embedding(vocab_size, seq_length)
        inputs = torch.bmm(att, encoded).squeeze()

albanD · June 17, 2020, 2:11pm

Can you check that using that latest version of pytorch, you don’t see this anymore?

prophet_zhan · June 18, 2020, 1:01am

Sorry, My GPU server’s cudatoolkit version is 10.0.130. It does not support latest version of pytorch.

ptrblck · June 18, 2020, 9:52am

The binaries ship with their own CUDA version, so that your local CUDA installation won’t be used or is your driver too old?
Your TitanV will work with CUDA10.2.89.