Cuda runtime error (999)

Just getting started with PyTorch (very nice system, btw). Unfortunately, The last couple days I’ve been trying to run unmodified tutorial code in PyCharm (mostly transformer_tutorial.py). Sometimes I get the following error in PyCharm:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=719 : unspecified launch failure

At this point, if I open a separate ipython console and try to check my GPU status, I get this:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCGeneral.cpp line=50 error=999 : unknown error

RuntimeError Traceback (most recent call last)
in
1 import torch
----> 2 torch.cuda.current_device()

~/Software/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py in current_device()
375 def current_device():
376 r""“Returns the index of a currently selected device.”""
–> 377 _lazy_init()
378 return torch._C._cuda_getDevice()
379

~/Software/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py in _lazy_init()
195 "Cannot re-initialize CUDA in forked subprocess. " + msg)
196 _check_driver()
–> 197 torch._C._cuda_init()
198 _cudart = _load_cudart()
199 _cudart.cudaGetErrorName.restype = ctypes.c_char_p

At other times, I have no problem checking my GPU and code accessing the GPU runs without problems. Things have broken twice now since yesterday evening and the problem doesn’t go away until I restart my computer, which is a pain given how much I have open (including VMs).

My configuration is Ubuntu 18.04, up to date; am using nvidia-driver-440 and all dependencies; and conda shows:
pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
cudatoolkit 10.1.243 h6bb024c_0

I do not have cuda or cudnn installed on my computer, but gather they are unnecessary when cudatoolkit is installed (I hope)?

nvidia-smi is as follows. I have the system using the cpu’s graphics to free up my GPU. But the following seems to show that the GPU is running the PyTorch code (it’s stopped in my debugger). One disconnect is, if I understand correctly, my version of PyTorch above wants cuda 10.1, while nvidia-smi seems to think it’s using (or wants to use?) cuda 10.2 (which would be the default for driver 440 I guess). However, as already noted, I don’t have cuda installed on my system as indicated above, other than through cudatools in anaconda, which I gather has cuda 10.1.

$ nvidia-smi
Thu Feb 13 15:26:28 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:01:00.0 Off | N/A |
| 12% 55C P2 41W / 225W | 761MiB / 7982MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16432 C …armProjects/PytorchTest/venv/bin/python 747MiB |
±----------------------------------------------------------------------------+

Any thoughts?

Oh, and I tried "os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0’ " with no positive effect. And ‘torch.version.cuda’ gives me ‘10.1’ .

If I’m right and I don’t need to install cuda and cudnn on my system so long as I have cudatools installed in Anaconda (could someone confirm this?), then the most likely source of my problem may be the nvidia driver, which is designed for CUDA 10.2 while PyTorch uses 10.1. I fell back to nvidia-driver-435. nvidia-smi now shows CUDA version as 10.1 and so far I haven’t run into any errors in PyTorch. Fingers crossed.

Yes, you are correct in the assumption that you don’t need a local CUDA and cudnn installation, if you are installing the binaries.
The NVIDIA driver should be sufficient.

Are you getting some other CUDA error while running the code or is this error raised randomly?

Thanks for the confirmation ptrblck! Looking this up online brings up a lot of older pages that seem to suggest it’s necessary to put CUDA and cuDNN on your system. But looking closely at others suggests otherwise. The pytorch install page doesn’t mention a separate install, but could be clearer in saying explicitly that these are not required.

I’m running unaltered tutorial code, so I wouldn’t expect runtime errors (there’s an error in one of the function implementations, but it won’t throw a runtime error). So, no, I get no other runtime errors and, as far as I can tell, these cudacheck errors are occurring at random.

Cheers, Peter

Thanks for the update.
Based on the description it sounds like your current setups might have some issues.
Were you seeing these errors before or did you just build your machine?
Also, do you see any other applications raising CUDA errors?
Could you run a stress test on the GPU?

Is there a GPU stress test you recommend? I am a bit worried that this computer, which was thrown together, may not have the capacity to handle a lot of heat. But I guess I can monitor that for a while. Also note that I have my GPU turned off from graphics duties–it’s not driving my X windows–but is available for calculation tasks. I

So far, the system with the 435 driver seems much more stable, though I continue to run into problems less frequently. So far, after a couple days of use, I had to reboot to get CUDA back again (same problem as my initial post here). Also, after starting / stopping the debugger (and a suspend thrown in), the debugger says no gpu is free, even when I open an ipython console in the debugger. However, when I open ipython from terminal, it sees the gpu. PyCharm problem I guess.

Peter

What do the temperature sensors report?
Are you seeing high temperatures on your GPU or the system in general?

Hi ptrblck: If you’re thinking that heat may be why I’m having these problems, it’s not. I’m very slowly stepping through PyTorch’s transformer implementation using PyCharm, not running code, at least on my work computer which has had the problem I’ve described. (I worry about heat on the system because I ran a HLTA analysis that took 3 days and the sensors reported momentary temperatures above 90C, where 100C is a critical temperature for my chips according to lmsensors. The run was not on the GPU and encountered no errors.)

There’s also another reason to think the problem I’ve reported is not due to hardware. My home computer, which is entirely different hardware (different manufacturer, chipset, GPU), is encountering the same problem I reported here. So, two different hardware setups, the same software setup, the same error–which suggests a software issue. I’m going to do some experimenting with suspend because it seems that I run into this problem after waking from suspend.

Thanks for the update.
A software issue related to the suspend mode on two different machines sounds quite unlucky, but might of course be the issue.
Just out of curiosity, which OS are you using?

I’m using up-to-date Ubuntu 18.04.

Yesterday I didn’t put my work computer to suspend and didn’t run into any CUDA problem all day (should have suspended at end of day to see its effect but got caught up in something else). Rebooted my home computer and also didn’t run into any CUDA problem. Suspended it overnight and this morning, CUDA was no longer accessibly via Python. Maybe worth noting that the GPU was still being used by Xorg and programs running on Xorg (my home computer uses GPU for video, work computer does not).

Will continue to test today. If this is a suspend issue, anywhere in particular I should report it to?

Peter

You could try to reload the nvidia kernel module via:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Ubuntu seems to have some issues with sleep/suspend (or maybe Linux in general?).
While I never suspend my workstation, my laptop isn’t able to connect via VPN after waking up.

Not sure where to report it.

I can confirm that the problem only seems to occur and seems to occur fairly reliably when I suspend and resume, but with PyCharm active and perhaps my debugger on. I just tried twice to cause CUDA to destabilize w/o PyCharm on and suspending / resuming, just using ipython to check for availability of cuda. Encountered no problem. Then I tried to destablize CUDA by having PyCharm on but not debugging. Also encountered no problem. Late today, I’ll try to try to again, but with the debugger running. I suspect that’s when it’ll fail–which should give me a good way to avoid breakdowns.

Also, tried rmmod nvidia_uvm (good suggestion–makes sense). Unfortunately, it gives an error msg saying nvidia_uvm is in use. I tried a variety of things, including removing other nvidia kernel modules, but whatever I do nvidia_uvm ‘is in use.’ I have to have ‘sudo prime-select nvidia’ in place or otherwise cuda is inaccessible, but the moment I use ‘sudo prime-select nvidia’ all the nvidia modules load. I can go back to ‘sudo prime-select intel’, but it takes a reboot to have any effect.

Yup, CUDA remains much more stably accessible when the PyCharm debugger is terminated before suspending a machine. I can use it all day with multiple suspends. Overnight CUDA did crash and I thought I had terminated the debugger but perhaps not.

Anyway, a workaround to the problem, most of the time, seems to be to terminate the PyCharm debugger before suspending, otherwise CUDA will almost certainly become inaccessible (at least on my Ubuntu 18.04 system).

For a full solution, something needs to be fixed in PyTorch (maybe Python?) or in CUDA. I just ran the transformer tutorial code in Python directly, w/o PyCharm. During the run, I momentarily suspended the linux system and then woke it. Immediately I got errors about CUDA and now CUDA is inaccessible in python, ipython, etc.

My best bet would still be on the weird interactions I see with Linux (Ubuntu?) suspend and usually a lot of drivers. Googling for issues with suspend yield a lot of this “undefined behavior”.