C++ example code causes seg fault on Google Colab

sakaia · April 24, 2019, 9:03am

I saw a strange behaviour on Google Colab. but I do not know why cuda/cudnn only causes Seg Fault. Is there any suggestion
Getting examples and compile and execute mnist.

!git clone http://github.com/pytorch/examples
!cd examples/cpp/mnist;cmake -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.6/dist-packages/torch/lib/ -DTorch_DIR=/usr/local/lib/python3.6/dist-packages/torch/share/cmake/Torch/ ; make
!cd examples/cpp/mnist; ./mnist

Then CUDA cause generates following error.

CUDA available! Training on GPU.

snip

Train Epoch: 10 [59584/60000] Loss: 0.0707
Test set: Average loss: 0.0533 | Accuracy: 0.983
/bin/bash: line 1:  1084 Segmentation fault      (core dumped) ./mnist

For CPU case works fine.

Training on CPU.

snip

Train Epoch: 10 [59584/60000] Loss: 0.0773
Test set: Average loss: 0.0512 | Accuracy: 0.984

Is there any suggestion? (Jupyter notebook problem?)
Command line itself works fine on general terminal.

sakaia · April 24, 2019, 12:46pm

By doing debugging mode, it stopped at libcudart

Compiling

!cd examples/cpp/mnist;export CFLAGS="-g -O0 -G";cmake -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.6/dist-packages/torch/lib/ -DTorch_DIR=/usr/local/lib/python3.6/dist-packages/torch/share/cmake/Torch/ ; make
!apt install gdb
!cd examples/cpp/mnist;gdb ./mnist

Output of gdb

Thread 1 "mnist" received signal SIGSEGV, Segmentation fault.
0x00007f4325e8a9fe in ?? () from /usr/local/cuda/lib64/libcudart.so.10.0
(gdb) bt
#0  0x00007f4325e8a9fe in ?? () from /usr/local/cuda/lib64/libcudart.so.10.0
#1  0x00007f4325e8f96b in ?? () from /usr/local/cuda/lib64/libcudart.so.10.0
#2  0x00007f4325ea4be2 in cudaDeviceSynchronize ()
   from /usr/local/cuda/lib64/libcudart.so.10.0
#3  0x00007f42eb590394 in cudnnDestroy ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#4  0x00007f42e710ecf1 in std::unordered_map<int, at::native::(anonymous namespace)::Handle, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, at::native::(anonymous namespace)::Handle> > >::~unordered_map() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#5  0x00007f42e4f58615 in __cxa_finalize (d=0x7f4318cd2780)
    at cxa_finalize.c:83
#6  0x00007f42e6f1cfb3 in __do_global_dtors_aux ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#7  0x00007ffd33020160 in ?? ()
#8  0x00007f4329c43b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC