Segfault loading model

ac2211 · December 18, 2017, 3:05pm

I save and load all my models using the following functions inside a model class:

def save_params(self, exDir):
	print 'saving params...'
	torch.save(self.state_dict(), join(exDir, 'dae_params'))

def load_params(self, exDir):
	print 'loading params...'
	self.load_state_dict(torch.load(join(exDir, 'dae_params')))

Normally the models save and load without error. However, I am currently getting the following error when trying to load model:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.c line=82 error=46 : all CUDA-capable devices are busy or unavailable
Segmentation fault

The model is of size 35M and GPU is of size 11439M. Any suggestions as to why this may be happening? I have checked (w/ nvidia-smi and there are GPUs available, and I am still able to load other models of the same size on the same GPU).

Thanks in advance,
Toni.

smth · December 18, 2017, 3:59pm

What version of PyTorch are you on (print(torch.__version__), and what is the output of nvidia-smi?

Can you also run the following and report back the log:

$ CUDA_LAUNCH_BLOCKING=1 gdb python
(gdb) r your_script.py
# when you get segfault
(gdb) bt

ac2211 · December 18, 2017, 4:03pm

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2512927 C python 731MiB |
| 1 2573733 C python 1584MiB |
| 3 2561442 C python 728MiB |
| 4 2573622 C python 1596MiB |
| 5 2573733 C python 244MiB |
±----------------------------------------------------------------------------+
GPU 2, 6 and 7 are free.

I set and check the gpu that I am using by printing:

torch.cuda.set_device(opts.gpuNo)
print 'using gpu:', torch.cuda.current_device()
dae.cuda()

ac2211 · December 18, 2017, 4:41pm

output:

(gdb) bt
#0  0x00007fff60134979 in THCudaStorage_free () from /vol/biomedic2/ac2211/venvPytorch/local/lib/python2.7/site-packages/torch/lib/libTHC.so.1
#1  0x00007fff86146ad7 in THCPFloatStorage_dealloc (self=0x7fff47a40878) at /pytorch/torch/csrc/generic/Storage.cpp:21
#2  0x00000000004fd53a in ?? ()
#3  0x00007fff8615125d in THPPointer<THCPFloatStorage>::~THPPointer (this=0x7fffffffd090, __in_chrg=<optimised out>) at /pytorch/torch/csrc/utils/object_ptr.h:12
#4  THCPFloatStorage_pynew (type=<optimised out>, args=<optimised out>, kwargs=<optimised out>) at /pytorch/torch/csrc/generic/Storage.cpp:172
#5  0x00000000004e741c in ?? ()
#6  0x00000000004b0c93 in PyObject_Call ()
#7  0x00000000004e6799 in ?? ()
#8  0x00000000004b6623 in ?? ()
#9  0x00000000004b0c93 in PyObject_Call ()
#10 0x00000000004c9f9f in PyEval_EvalFrameEx ()
#11 0x00000000004c2705 in PyEval_EvalCodeEx ()
#12 0x00000000004ca088 in PyEval_EvalFrameEx ()
#13 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#14 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#15 0x00000000004c2705 in PyEval_EvalCodeEx ()
#16 0x00000000004de69e in ?? ()
#17 0x00000000004b0c93 in PyObject_Call ()
#18 0x000000000045b099 in ?? ()
#19 0x000000000045a616 in ?? ()
#20 0x00000000004c96e0 in PyEval_EvalFrameEx ()
#21 0x00000000004c2705 in PyEval_EvalCodeEx ()
#22 0x00000000004ca7df in PyEval_EvalFrameEx ()
#23 0x00000000004c2705 in PyEval_EvalCodeEx ()
#24 0x00000000004ca088 in PyEval_EvalFrameEx ()
#25 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#26 0x00000000004c2705 in PyEval_EvalCodeEx ()
#27 0x00000000004c24a9 in PyEval_EvalCode ()
#28 0x00000000004f19ef in ?? ()
#29 0x00000000004ec372 in PyRun_FileExFlags ()
#30 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#31 0x000000000049e208 in Py_Main ()
#32 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=10, argv=0x7fffffffe478, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, 
    stack_end=0x7fffffffe468) at ../csu/libc-start.c:291
#33 0x000000000049da59 in _start ()
(gdb)

smth · December 18, 2017, 5:09pm

hmm., this is strange.

Is there any way I can publicly reproduce this?
Also, separate question: does the segfault also occur if you install pytorch in anaconda (instead of a python virtualenv)?

ac2211 · December 18, 2017, 5:22pm

I would need to send you the model parameters, and I am happy to paste the model and loading code here. Is there an easy way to share the params?

Also, separate question: does the segfault also occur if you install pytorch in anaconda (instead of a python virtualenv)?

I have not tried anaconda – yet.

smth · December 18, 2017, 5:23pm

You can use google drive for the parameters file.

ac2211 · December 18, 2017, 6:33pm

Parameters are here: https://drive.google.com/open?id=1yQoUQzoXFulQ_mV-2hthF0AC1RFfJpel

I have shared the model via gihub: https://github.com/ToniCreswell/pyTorch_DAAE

Alternatively this script reproduces the error (when called inside the scripts folder):

import sys
sys.path.append('../')
from models import DAE
PATH = path/to/param/folder #folder in which the params are saved not the params themselves
GPUNO = 6
dae = DAE(nz=200, imSize=64, fSize=64, sigma=0.25, multimodalZ=False)
if dae.useCUDA:
		torch.cuda.set_device(GPUNO)
		print 'using gpu:', torch.cuda.current_device()
		dae.cuda()
dae.eval()
dae.load_params(PATH)

smth · December 18, 2017, 8:37pm

I tried this under both anaconda python and a standard system python virtualenv.

For standard python, I used:

virtualenv venv
source venv/bin/activate
pip install http://download.pytorch.org/whl/cu90/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
pip install torchvision
python foo.py

The output I see is:

('using gpu:', 1L)
loading params...

I had to set GPUNO to 1 because I didn’t have 7 GPUs on the machine I am testing.
Does using a GPUNO of 0 or 1 fix it?

smth · December 18, 2017, 8:42pm

also, what OS are you running? I am running

$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

ac2211 · December 19, 2017, 10:08am

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.3 LTS
Release:	16.04
Codename:	xenial

ac2211 · January 15, 2018, 3:20pm

Had this same problem again loading a model that I had been able to load before, this fixed the problem:

self.load_state_dict(torch.load(join(exDir, 'params'), map_location=lambda storage, loc: storage.cuda(0)))

Nicolo_Savioli · May 1, 2018, 10:34am

Same problem here, i think is Ubuntu 16.04.3 LTS