CUDA installation cryptic errors

I’m trying to setup torch 1.1 with CUDA, but I’m encountering random errors difficult to debug. After installation, when I try to run the minimal code to test CUDA functionality:

import torch
torch.zeros(5).cuda()

I get random cryptic errors, one example being

fatal   : Memory allocation failure
fatal   : Memory allocation failure
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: unknown error

or a CUDA out of memory error (even tough I’m only trying to copy a small array) or even a segmentation fault, causing Python to crash.

To give some context, I’m using a conda environment with Python 3.7, and I have a CUDA 9.0 installation with cuDNN 7.5.1. I install PyTorch with the command from the website:

conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

I have also tried to using Pytorch 0.4 and CUDA 10.0/10.1, but the error persists. The installed driver version is 418.74. Finally, I do not have root access since I’m computing on a shared server and due to this reason CUDA is installed in a custom directory.

The PyTorch binaries ship with the CUDA and cudnn runtimes, so you would just need to install the driver on your machine.

Which GPU(s) are you using?
Does nvidia-smi run successfully?
Did you update the driver without restarting?

Thank you for the reply,

The GPUs I tested were GTX TITAN X, TITAN Xp and Tesla K40, and all three reported the same errors, though for the Xp Python hangs indefinitely rather than an explicit message.

In all cases nvidia-smi runs without issues. The drivers were not installed by me (as I do not have root access), but I think they work fine since other users of the server have not reported a problem. However the drivers were updated by the admin some time ago, just before the problems started. I clean installed each component I could, however could not make any progress.

Apparently cleaning the cache at ~/.nv might be relevant in some cases after a driver change, but it did not help for me. I wonder if PyTorch keeps other driver-related cache files somewhere that might be causing a mismatch.

The servers run Debian and I connect via ssh, if that might be relevant.

That’s really weird.
Does your system have docker and nvidia-docker on it?
If so, could you try to run the PyTorch Docker container as explained here?
I would like to try to create a clean environment and see, if that’s working.

Unfortunately the system does not have docker installed, not sure if there is another way to test this without root access. I also tried clean installing Python using pyenv and installing torch via pip to see if it was a conda related issue, this did not work either.

Could you also try to create a clean conda environment and reinstall PyTorch there?

If that doesn’t help and CUDA is installed on the machine, could you try to build PyTorch from source?

Are you successfully running any CUDA code in this machine?

I tried deleting miniconda entirely, reinstalling it, creating a new empty environment (just in case) and installing PyTorch with the command from the website. Unfortunately I ran into the exact same issue, once again.

On the other hand I can successfully compile and run the CUDA samples distributed with version 9.0.176, so CUDA and the nvcc tool seem to be working and configured properly, even though I understand it’s irrelevant for PyTorch.

I will shortly try to compile PyTorch from scratch and report the results.

So I tried to compile PyTorch from scratch with CUDA support. I installed CUDA toolkit 9.2 locally, configured the environment variables and compile-installed PyTorch to a clean conda environment (as described in the PyTorch repo).

Unfortunately, this seemed to not work either. CMake successfully detects CUDA and cudnn during compilation. Weirdly enough, after installation running

import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.backends.cudnn.enabled)
print()

torch.zeros(5).cuda()

prints

False
9.2.148
True

THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=51 error=2 : out of memory
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yardima/miniconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (2) : out of memory at ../aten/src/THC/THCGeneral.cpp:51

What could be wrong here? The CUDA 9.2 samples still compile and run cleanly, but PyTorch does not work.

As a side note, I tried compiling with CUDA 9.0, but CMake complained GCC version >6 is incompatible with CUDA <9.2. The server I use has GCC 6.3.0 but not GCC 5. Is there a way to circumvent this without going through the hassle of compiling GCC-5, as I do not have root priveleges?

Could you check the memory usage in nvidia-smi and see, if the GPU is used by other processes or if some dead processes are filling it up?

I’m not sure, if there is a simple way of changing gcc in your current setup.
The cleanest way would probably still be to use docker, since you can just wipe it in the worst case.

Checking nvidia-smi shows that the the server is almost entirely free. Among 8 GPUs available, one is assigned to me (set by CUDA_VISIBLE_DEVICES), which is always unused. But the problem persists even in cases where the server usage is zero.

I also realised that although docker is not supported, the server supports Singularity containers, maybe that could help?

Could you check, what torch.cuda.device_count() returns?
Are you seeing more than a single GPU and might accidentally use the wrong one?

I’m not familiar with Singularity containers, but you could give it a try.

Interestingly, even though CUDA_VISIBLE_DEVICES is set correctly, torch.cuda.device_count() returns 0.

Could you skip setting CUDA_VISIBLE_DEVICES and check the returned counts then?

It still returns zero, even if I do not set CUDA_VISIBLE_DEVICES.

Does torch.cuda.is_available() also return False?
Could you check, if the flag is set in your .bashrc or somewhere else by using
echo $CUDA_VISIBLE_DEVICES before running your script?

As I mentioned, running the minimal script

import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.backends.cudnn.enabled)

prints

False
9.2.148
True

I double checked the CUDA_VISIBLE_DEVICES flag, but it is not set explicitly prior to the PyTorch code.

I’m not sure what’s going on, and would personally try the brute force approach:

  • create a new conda env for each run and try to install all current binaries
  • start from the nightly build with CUDA10, then go down to CUDA9
  • end with the current stable version
  • try the pip versions next

Also, could you post the build log here which you have gotten while building from source?