Core dump when using PyTorch built from sources and setting cudnn.benchmark = True

Hi there,

I was trying to use the weight_norm() in the master branch so I built the bleeding edge version of PyTorch from source.

The error message is as below:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffd69bba8c in THCudaFree () from /home/user2/.conda/envs/pytorch_master/lib/python3.6/site-packages/torch/lib/libTHC.so.1

So could anyone tell what’s the best practice to build PyTorch from source?

Thanks!

What I would do personally is:

Well… I might start by doing the above, but here. So, you started to doing that, but missing:

  • the ginormous gist of full commands and output
  • full details on your os, anything weird/unusual/interesting about your system
1 Like

@Danlu_Chan do you have some small code that can reproduce this issue?

Hi,

I have used the conda virtual environment to build the master branch.

OS: Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-83-generic x86_64)
Python version: Python3
Driver Version: 375.66

conda create --name pytorch_master
source activate pytorch_master
export CMAKE_PREFIX_PATH=/home/user2/.conda/envs/pytorch_master/
git clone https://github.com/pytorch/pytorch
cd pytorch
conda install numpy pyyaml mkl setuptools cmake gcc cffi
conda install -c soumith magma-cuda80

And if I run python setup.py install directly, it will incur an import error:

Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'

Current thread 0x00007faeabc48700 (most recent call first):
[1]    12644 abort (core dumped)  python setup.py install

My solution to this is to deactivate the virtual environment and re-enter it again. This time I can install the PyTorch without any problem.

But no matter what program I am trying to run, as long as it uses CudaTensor, it would crash with a segmentation fault as above.

I am wondering if it’s because of the conda virtual envs. If I would like to install the bleeding edge version of python in a virtual env, could you please shed some light on the best way to do this?

Thanks,
Danlu

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.backends.cudnn as cudnn

input = torch.randn(64, 3, 32, 32).cuda()
input_var = Variable(input)

cudnn.benchmark = True

net = nn.Conv2d(3, 24, kernel_size=3, stride=1,
                       padding=1, bias=False).cuda()

net.train()

output_var = net(input_var)

Finally found where the seg fault comes from! It’s because I set cudnn.benchmark = True. Do you have any idea on it?

FYI: I could run v0.12 with the flag cudnn.benchmark = True on the same computer, so the installed cudnn is supposed to not be the problem. Is it possible that something goes wrong when linking to the cudnn lib?

Thanks,
Danlu

This is fixed in master. You’ll see it fixed in the next release.

I installed the 0.2+5254846 (master branch) just now but it seems that I still cannot use cudnn.benchmark = True. Am I misunderstanding something?

Thanks for your quick reply!

i ran your script on master, and it didn’t segfault for me. Can you give me a gdb stack-trace if it is crashing for you on the master branch?