Facing Issue While Install Pytorch on CUDA version: 11.4

Hello,
I am trying to install PyTorch on the AWS EC2 Instance but am not able to access the GPU
EC2 Instance Details:
Instance Name: Deep Learning AMI GPU CUDA 11.4.3 (Amazon Linux 2)
Instance Type: t2.xlarge
NVIDIA driver version: 510.47.03
CUDA version: 11.4

(base) [ec2-user@ip-XXXXX ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

GPU Details:
(base) [ec2-user@ip-XXXXX ~]$ lspci | grep VGA
00:02.0 VGA compatible controller: Cirrus Logic GD 5446

(test) [ec2-user@ip-XXXXX ~]$ python
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
False

Thanks in advance for any help…

1 Like

How did you install PyTorch?
As no CUDA runtime is available I would guess you’ve installed the CPU-only binaries?

I am installing PyTorch using the below command
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

What is torch.version.cuda returning?

torch.version.cuda
‘11.3’

In that case it seems your setup has trouble communicating with the GPU so maybe try to use a plain NVIDIA CUDA docker container, install the binaries there and see if it can find the GPU(s).
Alternatively also try to run any other CUDA application in your current setup and see if the device can be used.

I have not installed CUDA, it’s by default installed by the AWS.
While I am running nvidia-smi I got the response like this.
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This sounds indeed like a setup issue as the driver seems to be in a bad state.
Could you restart the node or lease another one to check if this would solve the issue? Once nvidia-smi is able to communicate with the driver again, try to run any CUDA sample and then a PyTorch application on the GPU.