ac2211
(Toni Creswell)
December 18, 2017, 3:05pm
1
I save and load all my models using the following functions inside a model class:
def save_params(self, exDir):
print 'saving params...'
torch.save(self.state_dict(), join(exDir, 'dae_params'))
def load_params(self, exDir):
print 'loading params...'
self.load_state_dict(torch.load(join(exDir, 'dae_params')))
Normally the models save and load without error. However, I am currently getting the following error when trying to load model:
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.c line=82 error=46 : all CUDA-capable devices are busy or unavailable
Segmentation fault
The model is of size 35M and GPU is of size 11439M. Any suggestions as to why this may be happening? I have checked (w/ nvidia-smi and there are GPUs available, and I am still able to load other models of the same size on the same GPU).
Thanks in advance,
Toni.
smth
December 18, 2017, 3:59pm
2
What version of PyTorch are you on (print(torch.__version__)
, and what is the output of nvidia-smi
?
Can you also run the following and report back the log:
$ CUDA_LAUNCH_BLOCKING=1 gdb python
(gdb) r your_script.py
# when you get segfault
(gdb) bt
ac2211
(Toni Creswell)
December 18, 2017, 4:03pm
3
smth:
(print(torch.version )
Version: ‘0.3.0.post4’
nvidia-smi:
Mon Dec 18 16:01:43 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:06:00.0 Off | 0 |
| N/A 61C P0 81W / 149W | 742MiB / 11439MiB | 1% E. Process |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 On | 00000000:07:00.0 Off | 0 |
| N/A 53C P0 133W / 149W | 1595MiB / 11439MiB | 100% E. Process |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 On | 00000000:0D:00.0 Off | 0 |
| N/A 35C P8 28W / 149W | 11MiB / 11439MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 On | 00000000:0E:00.0 Off | 0 |
| N/A 40C P0 91W / 149W | 739MiB / 11439MiB | 1% E. Process |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 On | 00000000:85:00.0 Off | 0 |
| N/A 64C P0 127W / 149W | 1607MiB / 11439MiB | 55% E. Process |
±------------------------------±---------------------±---------------------+
| 5 Tesla K80 On | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 71W / 149W | 255MiB / 11439MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 6 Tesla K80 On | 00000000:8D:00.0 Off | 0 |
| N/A 28C P8 27W / 149W | 11MiB / 11439MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 7 Tesla K80 On | 00000000:8E:00.0 Off | 0 |
| N/A 23C P8 30W / 149W | 11MiB / 11439MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2512927 C python 731MiB |
| 1 2573733 C python 1584MiB |
| 3 2561442 C python 728MiB |
| 4 2573622 C python 1596MiB |
| 5 2573733 C python 244MiB |
±----------------------------------------------------------------------------+
GPU 2, 6 and 7 are free.
I set and check the gpu that I am using by printing:
torch.cuda.set_device(opts.gpuNo)
print 'using gpu:', torch.cuda.current_device()
dae.cuda()
ac2211
(Toni Creswell)
December 18, 2017, 4:41pm
4
output:
(gdb) bt
#0 0x00007fff60134979 in THCudaStorage_free () from /vol/biomedic2/ac2211/venvPytorch/local/lib/python2.7/site-packages/torch/lib/libTHC.so.1
#1 0x00007fff86146ad7 in THCPFloatStorage_dealloc (self=0x7fff47a40878) at /pytorch/torch/csrc/generic/Storage.cpp:21
#2 0x00000000004fd53a in ?? ()
#3 0x00007fff8615125d in THPPointer<THCPFloatStorage>::~THPPointer (this=0x7fffffffd090, __in_chrg=<optimised out>) at /pytorch/torch/csrc/utils/object_ptr.h:12
#4 THCPFloatStorage_pynew (type=<optimised out>, args=<optimised out>, kwargs=<optimised out>) at /pytorch/torch/csrc/generic/Storage.cpp:172
#5 0x00000000004e741c in ?? ()
#6 0x00000000004b0c93 in PyObject_Call ()
#7 0x00000000004e6799 in ?? ()
#8 0x00000000004b6623 in ?? ()
#9 0x00000000004b0c93 in PyObject_Call ()
#10 0x00000000004c9f9f in PyEval_EvalFrameEx ()
#11 0x00000000004c2705 in PyEval_EvalCodeEx ()
#12 0x00000000004ca088 in PyEval_EvalFrameEx ()
#13 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#14 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#15 0x00000000004c2705 in PyEval_EvalCodeEx ()
#16 0x00000000004de69e in ?? ()
#17 0x00000000004b0c93 in PyObject_Call ()
#18 0x000000000045b099 in ?? ()
#19 0x000000000045a616 in ?? ()
#20 0x00000000004c96e0 in PyEval_EvalFrameEx ()
#21 0x00000000004c2705 in PyEval_EvalCodeEx ()
#22 0x00000000004ca7df in PyEval_EvalFrameEx ()
#23 0x00000000004c2705 in PyEval_EvalCodeEx ()
#24 0x00000000004ca088 in PyEval_EvalFrameEx ()
#25 0x00000000004c9d7f in PyEval_EvalFrameEx ()
#26 0x00000000004c2705 in PyEval_EvalCodeEx ()
#27 0x00000000004c24a9 in PyEval_EvalCode ()
#28 0x00000000004f19ef in ?? ()
#29 0x00000000004ec372 in PyRun_FileExFlags ()
#30 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#31 0x000000000049e208 in Py_Main ()
#32 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=10, argv=0x7fffffffe478, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>,
stack_end=0x7fffffffe468) at ../csu/libc-start.c:291
#33 0x000000000049da59 in _start ()
(gdb)
smth
December 18, 2017, 5:09pm
5
hmm., this is strange.
Is there any way I can publicly reproduce this?
Also, separate question: does the segfault also occur if you install pytorch in anaconda (instead of a python virtualenv)?
ac2211
(Toni Creswell)
December 18, 2017, 5:22pm
6
I would need to send you the model parameters, and I am happy to paste the model and loading code here. Is there an easy way to share the params?
Also, separate question: does the segfault also occur if you install pytorch in anaconda (instead of a python virtualenv)?
I have not tried anaconda – yet.
smth
December 18, 2017, 5:23pm
7
You can use google drive for the parameters file.
ac2211
(Toni Creswell)
December 18, 2017, 6:33pm
8
Parameters are here: https://drive.google.com/open?id=1yQoUQzoXFulQ_mV-2hthF0AC1RFfJpel
I have shared the model via gihub: https://github.com/ToniCreswell/pyTorch_DAAE
Alternatively this script reproduces the error (when called inside the scripts folder):
import sys
sys.path.append('../')
from models import DAE
PATH = path/to/param/folder #folder in which the params are saved not the params themselves
GPUNO = 6
dae = DAE(nz=200, imSize=64, fSize=64, sigma=0.25, multimodalZ=False)
if dae.useCUDA:
torch.cuda.set_device(GPUNO)
print 'using gpu:', torch.cuda.current_device()
dae.cuda()
dae.eval()
dae.load_params(PATH)
smth
December 18, 2017, 8:37pm
9
I tried this under both anaconda python and a standard system python virtualenv.
For standard python, I used:
virtualenv venv
source venv/bin/activate
pip install http://download.pytorch.org/whl/cu90/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
pip install torchvision
python foo.py
The output I see is:
('using gpu:', 1L)
loading params...
I had to set GPUNO to 1 because I didn’t have 7 GPUs on the machine I am testing.
Does using a GPUNO of 0 or 1 fix it?
smth
December 18, 2017, 8:42pm
10
also, what OS are you running? I am running
$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
ac2211
(Toni Creswell)
December 19, 2017, 10:08am
11
smth:
lsb_release -a
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
ac2211
(Toni Creswell)
January 15, 2018, 3:20pm
12
Had this same problem again loading a model that I had been able to load before, this fixed the problem:
self.load_state_dict(torch.load(join(exDir, 'params'), map_location=lambda storage, loc: storage.cuda(0)))
Same problem here, i think is Ubuntu 16.04.3 LTS