Just getting started with PyTorch (very nice system, btw). Unfortunately, The last couple days I’ve been trying to run unmodified tutorial code in PyCharm (mostly transformer_tutorial.py). Sometimes I get the following error in PyCharm:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=719 : unspecified launch failure
At this point, if I open a separate ipython console and try to check my GPU status, I get this:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCGeneral.cpp line=50 error=999 : unknown error
RuntimeError Traceback (most recent call last)
in
1 import torch
----> 2 torch.cuda.current_device()
~/Software/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py in current_device()
375 def current_device():
376 r""“Returns the index of a currently selected device.”""
–> 377 _lazy_init()
378 return torch._C._cuda_getDevice()
379
~/Software/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py in _lazy_init()
195 "Cannot re-initialize CUDA in forked subprocess. " + msg)
196 _check_driver()
–> 197 torch._C._cuda_init()
198 _cudart = _load_cudart()
199 _cudart.cudaGetErrorName.restype = ctypes.c_char_p
At other times, I have no problem checking my GPU and code accessing the GPU runs without problems. Things have broken twice now since yesterday evening and the problem doesn’t go away until I restart my computer, which is a pain given how much I have open (including VMs).
My configuration is Ubuntu 18.04, up to date; am using nvidia-driver-440 and all dependencies; and conda shows:
pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
cudatoolkit 10.1.243 h6bb024c_0
I do not have cuda or cudnn installed on my computer, but gather they are unnecessary when cudatoolkit is installed (I hope)?
nvidia-smi is as follows. I have the system using the cpu’s graphics to free up my GPU. But the following seems to show that the GPU is running the PyTorch code (it’s stopped in my debugger). One disconnect is, if I understand correctly, my version of PyTorch above wants cuda 10.1, while nvidia-smi seems to think it’s using (or wants to use?) cuda 10.2 (which would be the default for driver 440 I guess). However, as already noted, I don’t have cuda installed on my system as indicated above, other than through cudatools in anaconda, which I gather has cuda 10.1.
$ nvidia-smi
Thu Feb 13 15:26:28 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:01:00.0 Off | N/A |
| 12% 55C P2 41W / 225W | 761MiB / 7982MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16432 C …armProjects/PytorchTest/venv/bin/python 747MiB |
±----------------------------------------------------------------------------+
Any thoughts?