If I’m right and I don’t need to install cuda and cudnn on my system so long as I have cudatools installed in Anaconda (could someone confirm this?), then the most likely source of my problem may be the nvidia driver, which is designed for CUDA 10.2 while PyTorch uses 10.1. I fell back to nvidia-driver-435. nvidia-smi now shows CUDA version as 10.1 and so far I haven’t run into any errors in PyTorch. Fingers crossed.
Yes, you are correct in the assumption that you don’t need a local CUDA and cudnn installation, if you are installing the binaries.
The NVIDIA driver should be sufficient.
Are you getting some other CUDA error while running the code or is this error raised randomly?
Thanks for the confirmation ptrblck! Looking this up online brings up a lot of older pages that seem to suggest it’s necessary to put CUDA and cuDNN on your system. But looking closely at others suggests otherwise. The pytorch install page doesn’t mention a separate install, but could be clearer in saying explicitly that these are not required.
I’m running unaltered tutorial code, so I wouldn’t expect runtime errors (there’s an error in one of the function implementations, but it won’t throw a runtime error). So, no, I get no other runtime errors and, as far as I can tell, these cudacheck errors are occurring at random.
Thanks for the update.
Based on the description it sounds like your current setups might have some issues.
Were you seeing these errors before or did you just build your machine?
Also, do you see any other applications raising CUDA errors?
Could you run a stress test on the GPU?
Is there a GPU stress test you recommend? I am a bit worried that this computer, which was thrown together, may not have the capacity to handle a lot of heat. But I guess I can monitor that for a while. Also note that I have my GPU turned off from graphics duties–it’s not driving my X windows–but is available for calculation tasks. I
So far, the system with the 435 driver seems much more stable, though I continue to run into problems less frequently. So far, after a couple days of use, I had to reboot to get CUDA back again (same problem as my initial post here). Also, after starting / stopping the debugger (and a suspend thrown in), the debugger says no gpu is free, even when I open an ipython console in the debugger. However, when I open ipython from terminal, it sees the gpu. PyCharm problem I guess.
What do the temperature sensors report?
Are you seeing high temperatures on your GPU or the system in general?
Hi ptrblck: If you’re thinking that heat may be why I’m having these problems, it’s not. I’m very slowly stepping through PyTorch’s transformer implementation using PyCharm, not running code, at least on my work computer which has had the problem I’ve described. (I worry about heat on the system because I ran a HLTA analysis that took 3 days and the sensors reported momentary temperatures above 90C, where 100C is a critical temperature for my chips according to lmsensors. The run was not on the GPU and encountered no errors.)
There’s also another reason to think the problem I’ve reported is not due to hardware. My home computer, which is entirely different hardware (different manufacturer, chipset, GPU), is encountering the same problem I reported here. So, two different hardware setups, the same software setup, the same error–which suggests a software issue. I’m going to do some experimenting with suspend because it seems that I run into this problem after waking from suspend.
Thanks for the update.
A software issue related to the suspend mode on two different machines sounds quite unlucky, but might of course be the issue.
Just out of curiosity, which OS are you using?
I’m using up-to-date Ubuntu 18.04.
Yesterday I didn’t put my work computer to suspend and didn’t run into any CUDA problem all day (should have suspended at end of day to see its effect but got caught up in something else). Rebooted my home computer and also didn’t run into any CUDA problem. Suspended it overnight and this morning, CUDA was no longer accessibly via Python. Maybe worth noting that the GPU was still being used by Xorg and programs running on Xorg (my home computer uses GPU for video, work computer does not).
Will continue to test today. If this is a suspend issue, anywhere in particular I should report it to?
You could try to reload the nvidia kernel module via:
sudo rmmod nvidia_uvm sudo modprobe nvidia_uvm
Ubuntu seems to have some issues with sleep/suspend (or maybe Linux in general?).
While I never suspend my workstation, my laptop isn’t able to connect via VPN after waking up.
Not sure where to report it.
I can confirm that the problem only seems to occur and seems to occur fairly reliably when I suspend and resume, but with PyCharm active and perhaps my debugger on. I just tried twice to cause CUDA to destabilize w/o PyCharm on and suspending / resuming, just using ipython to check for availability of cuda. Encountered no problem. Then I tried to destablize CUDA by having PyCharm on but not debugging. Also encountered no problem. Late today, I’ll try to try to again, but with the debugger running. I suspect that’s when it’ll fail–which should give me a good way to avoid breakdowns.
Also, tried rmmod nvidia_uvm (good suggestion–makes sense). Unfortunately, it gives an error msg saying nvidia_uvm is in use. I tried a variety of things, including removing other nvidia kernel modules, but whatever I do nvidia_uvm ‘is in use.’ I have to have ‘sudo prime-select nvidia’ in place or otherwise cuda is inaccessible, but the moment I use ‘sudo prime-select nvidia’ all the nvidia modules load. I can go back to ‘sudo prime-select intel’, but it takes a reboot to have any effect.
Yup, CUDA remains much more stably accessible when the PyCharm debugger is terminated before suspending a machine. I can use it all day with multiple suspends. Overnight CUDA did crash and I thought I had terminated the debugger but perhaps not.
Anyway, a workaround to the problem, most of the time, seems to be to terminate the PyCharm debugger before suspending, otherwise CUDA will almost certainly become inaccessible (at least on my Ubuntu 18.04 system).
For a full solution, something needs to be fixed in PyTorch (maybe Python?) or in CUDA. I just ran the transformer tutorial code in Python directly, w/o PyCharm. During the run, I momentarily suspended the linux system and then woke it. Immediately I got errors about CUDA and now CUDA is inaccessible in python, ipython, etc.
My best bet would still be on the weird interactions I see with Linux (Ubuntu?) suspend and usually a lot of drivers. Googling for issues with suspend yield a lot of this “undefined behavior”.
I solved this problem with your method. Thank you so much.
This worked for me. I’ve also had issues with suspend.
Error : RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
RTX 2080 Super
For some reason, having a local jupyter notebook running uses the gpu in a mysterious way that it doesn’t like. I run most of my stuff on .py’s. The workaround is not so nice. Just have to launch a docker to use jupyter notebook which is definitely a little less than ideal.
Hello, I ended up with this solution, no reboot requires:
sudo rmmod nvidia_uvm sudo modprobe nvidia_uvm
After executing such commands, I can use PyTorch again.
Is there any way, I could fix this without crashing in first place? I actually run some PyTorch scripts, and then suspend the laptop if I have to move to another place, but unfortunately, the process crashes due to the same error and I have to re-run the entire script again.
This worked! Thank you so much