Segmentation fault (core dumped). when I was using CUDA

My model can run slowly in cpu, but it cannot run in GPU.
When I was using CUDA(10.0.130), I will get Segmentation fault (core dumped).
So I try to use gdb python, and I got:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f231cdd9cc0 in _IO_vfprintf_internal (s=s@entry=0x7ffd3aee5f00, format=<optimized out>, format@entry=0x7f2319b6e4f0 "expected %s (got %s)", ap=ap@entry=0x7ffd3aee64a8)
    at vfprintf.c:1632
1632	vfprintf.c: No such file or directory.
(gdb) where
#0  0x00007f231cdd9cc0 in _IO_vfprintf_internal (s=s@entry=0x7ffd3aee5f00, format=<optimized out>, format@entry=0x7f2319b6e4f0 "expected %s (got %s)", ap=ap@entry=0x7ffd3aee64a8)
    at vfprintf.c:1632
#1  0x00007f231ce01a49 in _IO_vsnprintf (string=0x7ffd3aee6070 "expected \235+\373\263U", maxlen=<optimized out>, format=0x7f2319b6e4f0 "expected %s (got %s)", args=0x7ffd3aee64a8)
    at vsnprintf.c:114
#2  0x00007f231963d54d in torch::formatMessage(char const*, __va_list_tag*) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#3  0x00007f231963db11 in torch::TypeError::TypeError(char const*, ...) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#4  0x00007f23198c3a71 in torch::utils::(anonymous namespace)::new_with_tensor(c10::TensorTypeId, c10::ScalarType, at::Tensor const&) ()
   from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007f23198c5e20 in torch::utils::legacy_tensor_ctor(c10::TensorTypeId, c10::ScalarType, _object*, _object*) ()
   from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007f2319897a40 in torch::tensors::Tensor_new(_typeobject*, _object*, _object*) () from /root/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#7  0x000055b361d92239 in _PyObject_FastCallKeywords ()
#8  0x000055b361dee6b2 in _PyEval_EvalFrameDefault ()
#9  0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#10 0x000055b361d3033c in _PyFunction_FastCallDict ()
#11 0x000055b361d46a03 in _PyObject_Call_Prepend ()
#12 0x000055b361d3b8d2 in PyObject_Call ()
#13 0x000055b361deb1ab in _PyEval_EvalFrameDefault ()
#14 0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#15 0x000055b361d3033c in _PyFunction_FastCallDict ()
#16 0x000055b361d46a03 in _PyObject_Call_Prepend ()
#17 0x000055b361d89baa in slot_tp_call ()
#18 0x000055b361d9261b in _PyObject_FastCallKeywords ()
#19 0x000055b361deea79 in _PyEval_EvalFrameDefault ()
#20 0x000055b361d2f059 in _PyEval_EvalCodeWithName ()
#21 0x000055b361d2ff24 in PyEval_EvalCodeEx ()
#22 0x000055b361d2ff4c in PyEval_EvalCode ()
#23 0x000055b361e48a14 in run_mod ()
#24 0x000055b361e51f11 in PyRun_FileExFlags ()
#25 0x000055b361e52104 in PyRun_SimpleFileExFlags ()
#26 0x000055b361e53bbd in pymain_main.constprop ()
#27 0x000055b361e53e30 in _Py_UnixMain ()
#28 0x00007f231cdab830 in __libc_start_main (main=0x55b361d0fd20 <main>, argc=2, argv=0x7ffd3aee7828, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd3aee7818) at ../csu/libc-start.c:291
#29 0x000055b361df9052 in _start () at ../sysdeps/x86_64/elf/start.S:103

So what should I change my code, or it is a pytorch bug?

Hi,

That looks bad indeed.
The segfault happens while pytorch was trying to raise a Type Error when constructing a Tensor.
Do you have a small code sample that reproduces this behavior? I would be happy to take a closer look !

Thanks,

You are right. I try to construct a LongTensor that was already a Tensor.:grinning:

Do you have a repro? Because whatever you do. you should never get segfault :smiley:
I would like to fix that if possible.

This issue might be related to this one.

@prophet_zhan are you using an older PyTorch version, as this should have been fixed already?

I am using 1.3.1 in GPU server (TitanV * 2). And I try to run a single-GPU module on multiple GPU.

My model can run now. Thanks.
@albanD That is what I changed:

        decoder_in, s, w = decoder_initial(x.size(0))
        decoder_in = y[:, 0]
        decoder_in_1 = decoder_in.to('cuda:1')

        # 1.7. for each decoder timestep
        for j in range(y.size(1) - 1):  # for all sequences
            """
            decoder_in (Variable): [b]
            encoded (Variable): [b x seq x hid]
            input_out (np.array): [b x seq]
            s (Variable): [b x hid]
            """
            # 1.7.1.1st state - create [out]
            if j == 0:
                h_out, c_out, h_add = midlstm(None, None, y=decoder_in, order=j, encoded=encoded)
                if torch.cuda.is_available():
                    h_add = h_add.to('cuda:1')

And in midlstm layer:

    def forward(self, h_0, c_0, y, order, encoded):

        y[y >= self.vocab_size] = 1  # y is narray
        b = encoded.size(0)  # batch size
        seq = encoded.size(1)  # input sequence length
        hidden_size = self.hidden_size

        # I delete
        #  `y = torch.LongTensor(y) `
        #  and it works
        att = self.Emb(y).unsqueeze(1)
        # In __init__: self.Emb = nn.Embedding(vocab_size, seq_length)
        inputs = torch.bmm(att, encoded).squeeze()

Can you check that using that latest version of pytorch, you don’t see this anymore?

Sorry, My GPU server’s cudatoolkit version is 10.0.130. It does not support latest version of pytorch.

The binaries ship with their own CUDA version, so that your local CUDA installation won’t be used or is your driver too old?
Your TitanV will work with CUDA10.2.89.