RuntimeError: cuda runtime error (8): invalid device function at /opt/conda/conda-bld/pytorch_1503963423183/work/torch/lib/THC/THCTensorCopy.cu:204

I have a code. I wanted to run it by GPU to accelerate computations. For this purpose I installed NVIDIA driver, CUDA toolkit, and CUDNN.

This is the properties of my system:

__Python VERSION: 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
__pyTorch VERSION: 0.2.0_4
__CUDA VERSION: v9.0.176
__CUDNN VERSION: 6021
__Number CUDA Devices: 1
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, GeForce GT 425M, 384.130, 964 MiB, 195 MiB, 769 MiB
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0

And this is a part of my code to use GPU:

    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [33, xxx] -> [11, ...], [11, ...], [11, ...] on 3 GPUs
        my_rnn_model = nn.DataParallel(my_rnn_model)

    if torch.cuda.is_available():
        print("torch.cuda.is_available() is:",torch.cuda.is_available())
        my_rnn_model.cuda()

But I received CUDNN-STATUS-ARCH-MISMATCH error.
I googled this error and found that CUDNN needs CUDA compute capability equals or higher than 3.0. But I think that this is 2.1 for me. Therefore I thought that the problem is from my hardware and I can not use GPU.

I decided to come back to my code to run by CPU (without GPU). So I removed that part of my code that if cuda is available then my_rnn_model.cuda().

But Now I expect that the code can be run like before installing CUDA. While not and Now I receive this error:
RuntimeError: cuda runtime error (8): invalid device function at /opt/conda/conda-bld/pytorch_1503963423183/work/torch/lib/THC/THCTensorCopy.cu:204

Do you have any suggestion how can I fix it?
Thanks.

I also encouter this problem, GT 650M,cuda8.0,pytorch0.4.0

I would guess PyTorch 0.4.0 might use a min. compute capability of 3.5 while your device uses 3.0.
Take a look at the stack trace and try to check which operation / function call fails.
Once you’ve isolated it, check which compute capability introduced it.