Core dump when using PyTorch built from sources and setting cudnn.benchmark = True

Danlu_Chan · July 13, 2017, 6:20pm

Hi there,

I was trying to use the weight_norm() in the master branch so I built the bleeding edge version of PyTorch from source.

The error message is as below:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffd69bba8c in THCudaFree () from /home/user2/.conda/envs/pytorch_master/lib/python3.6/site-packages/torch/lib/libTHC.so.1

So could anyone tell what’s the best practice to build PyTorch from source?

Thanks!

hughperkins · July 14, 2017, 7:18am

What I would do personally is:

follow the instructions at https://github.com/pytorch/pytorch#from-source rigorously, to the letter, copy and paste the entire output, from start to finish, including all commands, output etc, into one ginormous https://gist.github.com
log the issue to github.com/pytorch/pytorch/issues
provide full detail on:
- which OS
- 32-bit/64-bit (by the way, 32-bit almost certainly not supported)
- anything slightly weird/odd/unusual/interesting about your system

Well… I might start by doing the above, but here. So, you started to doing that, but missing:

the ginormous gist of full commands and output
full details on your os, anything weird/unusual/interesting about your system

smth · July 14, 2017, 6:23pm

@Danlu_Chan do you have some small code that can reproduce this issue?

Danlu_Chan · July 17, 2017, 3:18pm

Hi,

I have used the conda virtual environment to build the master branch.

OS: Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-83-generic x86_64)
Python version: Python3
Driver Version: 375.66

conda create --name pytorch_master
source activate pytorch_master
export CMAKE_PREFIX_PATH=/home/user2/.conda/envs/pytorch_master/
git clone https://github.com/pytorch/pytorch
cd pytorch
conda install numpy pyyaml mkl setuptools cmake gcc cffi
conda install -c soumith magma-cuda80

And if I run python setup.py install directly, it will incur an import error:

Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'

Current thread 0x00007faeabc48700 (most recent call first):
[1]    12644 abort (core dumped)  python setup.py install

My solution to this is to deactivate the virtual environment and re-enter it again. This time I can install the PyTorch without any problem.

But no matter what program I am trying to run, as long as it uses CudaTensor, it would crash with a segmentation fault as above.

I am wondering if it’s because of the conda virtual envs. If I would like to install the bleeding edge version of python in a virtual env, could you please shed some light on the best way to do this?

Thanks,
Danlu

Danlu_Chan · July 20, 2017, 4:33am

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.backends.cudnn as cudnn

input = torch.randn(64, 3, 32, 32).cuda()
input_var = Variable(input)

cudnn.benchmark = True

net = nn.Conv2d(3, 24, kernel_size=3, stride=1,
                       padding=1, bias=False).cuda()

net.train()

output_var = net(input_var)

Finally found where the seg fault comes from! It’s because I set cudnn.benchmark = True. Do you have any idea on it?

FYI: I could run v0.12 with the flag cudnn.benchmark = True on the same computer, so the installed cudnn is supposed to not be the problem. Is it possible that something goes wrong when linking to the cudnn lib?

Thanks,
Danlu

smth · July 20, 2017, 5:43am

This is fixed in master. You’ll see it fixed in the next release.

Danlu_Chan · July 20, 2017, 2:24pm

I installed the 0.2+5254846 (master branch) just now but it seems that I still cannot use cudnn.benchmark = True. Am I misunderstanding something?

Thanks for your quick reply!

smth · July 20, 2017, 5:20pm

i ran your script on master, and it didn’t segfault for me. Can you give me a gdb stack-trace if it is crashing for you on the master branch?