GPU not available with CUDA 12.1 on Amazon V100 GPU instance

Hi, I am trying to set up Pytorch on a Amazon V100 GPU instance with CUDA 12.1.

The Pytorch download page (Start Locally | PyTorch) didn’t have a version for CUDA 12.1 so I used:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

as this discussion suggests: Install pytorch with Cuda 12.1

I have the Nvidia drivers configured correctly on my instance:

% nvidia-smi
Wed Mar 15 16:40:19 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB            Off| 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0               43W / 300W|      0MiB / 16384MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

and:

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

But on installing Pytorch with the above command I get:

>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.is_available()
False

Any help would be great,
Thanks

Are you able to run any other CUDA application on this system and in your environment?

Thanks @ptrblck_de, what would be the simplest way to check this?

So, I tried to see if the environment can support a pytorch-gpu docker image:

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.02-py3

Status: Downloaded newer image for nvcr.io/nvidia/pytorch:23.02-py3
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Looks like a CUDA problem?

It looks more like a driver problem. What does nvidia-smi return on the bare metal machine (without running any docker containers)?

% nvidia-smi
Wed Mar 15 16:40:19 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB            Off| 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0               43W / 300W|      0MiB / 16384MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Could you check if the correct nvidia-container-toolkit is installed?
You could take a look at this issue which describes the same issue and fixes.