Alternating Cuda errors: "invalid device" and "CUDNN_STATUS_NOT_INITIALIZED"

I finally got my code working on my local cpu and now wanted to begin training in earnest using a cloud (AWS) machine with GPU. I thought the process of switching from cpu to gpu would be relatively simple but, wow, was I wrong.

To the best of my understanding I have the proper versions of pytorch and cuda running…

%> nvidia-smi
Wed Mar 24 18:53:53 2021       
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0    26W /  70W |   2251MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
__Python VERSION: 3.6.8 (default, Nov 16 2020, 16:55:22) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
__pyTorch VERSION: 1.8.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
__Number CUDA Devices: 1
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0

I’ve reached a situation where, without changing anything in the code, I will get a different error with each alternate run. The first attempt errors with:

/dev/lib64/python3.6/site-packages/torch/nn/modules/ in _conv_forward(self, input, weight, bias)
    394                             _pair(0), self.dilation, self.groups)
    395         return F.conv2d(input, weight, bias, self.stride,
--> 396                         self.padding, self.dilation, self.groups)
    398     def forward(self, input: Tensor) -> Tensor:


Simply running again will then produce:

38         for x_ims, x, y in dataloader:
---> 39             x_ims = DEVICE)
     40             x = DEVICE)
     41             y = DEVICE)

RuntimeError: CUDA error: invalid device function

Alternate runs will switch back and forth between these errors.

When I then go through and enter the just lines where the problem arose, I get no error at all. So I’m utterly confused. This is my first attempt as using a GPU for computation and, to be honest, I’m a little daunted by all the management that seems to be required.

You might be running into this error.
If that’s the case (PyTorch 1.8.0 pip wheels with CUDA10.2 runtime on a Turing GPU), then install the nightly release, the CUDA11.1 pip wheel, or any conda binary to fix it.

I haven’t been using conda, and would prefer not to complicate things by adding another package manager if I don’t have to. I’ve been using pip and I’m concerned the two together may introduce further conflict ( though I really don’t know).

Just so I’m clear. Without using conda, You’re suggesting I either install the nightly release of pyTorch? or the CUDA11.1 pip wheel? I’m afraid I don’t entirely follow what installing the “CUDA11.1 pip wheel” means. Still learning the ins and outs of this. Does that mean I would use pip to install CUDA11.1? Do I have to remove my current version first or will pip take care of that?

Thank you in advance.

I would recommend to uninstall the current pip wheel via pip uninstall torch and then either install the nightly from here (select “Preview (Nightly)”) with CUDA 10.2 or 11.1 or “Stable (1.8.0)” release with CUDA 11.1.

The pip wheels and conda binaries ship with their own CUDA runtime (as well as cudnn, NCCL, etc.). You won’t get the complete CUDA toolkit, but could use PyTorch on the GPU with it (except compiling custom CUDA extensions, building PyTorch from source etc.).

Well, I’ve done as you suggested. Thank you. I have even tried going back to scratch, removing python and reinstalling and after a very large effort in de-tangling the various libraries, things are up and running again, this time with CUDA11.1

I can’t thank you enough for pointing me in the right direction. Things are starting to run, so i’m on to the next phase of wrestling with memory and when and where to pull things over to the cpu. But things are looking up. I’m sure there is a very good reason for all the juggling of data from one device to another but, as a beginner, it’s difficult to be forced into that level of control. I’m sure one day it will all just make sense.

Thanks again.