Facing Issue While Install Pytorch on CUDA version: 11.4

GAME_IS_ON · May 28, 2022, 7:39am

Hello,
I am trying to install PyTorch on the AWS EC2 Instance but am not able to access the GPU
EC2 Instance Details:
Instance Name: Deep Learning AMI GPU CUDA 11.4.3 (Amazon Linux 2)
Instance Type: t2.xlarge
NVIDIA driver version: 510.47.03
CUDA version: 11.4

(base) [ec2-user@ip-XXXXX ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

GPU Details:
(base) [ec2-user@ip-XXXXX ~]$ lspci | grep VGA
00:02.0 VGA compatible controller: Cirrus Logic GD 5446

(test) [ec2-user@ip-XXXXX ~]$ python
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
False

Thanks in advance for any help…

ptrblck · May 28, 2022, 7:46am

How did you install PyTorch?
As no CUDA runtime is available I would guess you’ve installed the CPU-only binaries?

GAME_IS_ON · May 28, 2022, 7:47am

I am installing PyTorch using the below command
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

ptrblck · May 28, 2022, 7:48am

What is torch.version.cuda returning?

GAME_IS_ON · May 28, 2022, 7:48am

torch.version.cuda
‘11.3’

ptrblck · May 28, 2022, 7:50am

In that case it seems your setup has trouble communicating with the GPU so maybe try to use a plain NVIDIA CUDA docker container, install the binaries there and see if it can find the GPU(s).
Alternatively also try to run any other CUDA application in your current setup and see if the device can be used.

GAME_IS_ON · May 28, 2022, 11:38am

I have not installed CUDA, it’s by default installed by the AWS.
While I am running nvidia-smi I got the response like this.
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

ptrblck · May 30, 2022, 5:04am

This sounds indeed like a setup issue as the driver seems to be in a bad state.
Could you restart the node or lease another one to check if this would solve the issue? Once nvidia-smi is able to communicate with the driver again, try to run any CUDA sample and then a PyTorch application on the GPU.