Pytorch cuda problem on gpu

Hii pytorch developer, i am trying to use pytorch on gpu, however it shows ValueError: invalid literal for int() with base 10: “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

additionally, it shows
/home/seis/ret/prog/anaconda3/envs/cpi/lib/python3.7/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: Download Drivers | NVIDIA Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729066392/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False

Please suggest which pytorch version will support The NVIDIA driver on your system is too old (found version 9010)

You can take a look at this doc in particular Table 3 shows the driver requirement for each CUDA release.

Which pytorch version will be compatible with NVIDIA driver version 9010.Can you please suggest, because what the gpu i am using it is older and the engineer doesn’t want to upgrade it to the latest version of NVIDIA driver.

You could check on the previous Pytorch versions webpage to see which Pytorch version will work with the CUDA driver you have.

@Manuel_Alejandro_Dia ```
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch

I think cudatoolkit==9.2 will not work so i need cudatoolkit==9.1 how to get it ???

Maybe you could try with:

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch

getting error again

In [1]: import torch

In [2]: torch.version.cuda
Out[2]: ‘9.0.176’

In [3]: torch.cuda.is_available()
Out[3]: False

Can you post the output of nvidia-smi if you are in Linux?

Also, maybe check this stackoverflow thread.

Hello,
I have windows operating system, where i am using WSL for running linux based codes. I have NCCl installed and its workinh fine for distributed pytorch. I have checked it using python script.

Now I have installed torch

when I am importing torch and checking torch.cuda.is_available()

Getting error,
/home/joy/miniconda3/envs/VideoMae/lib/python3.8/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /opt/conda/conda-bld/pytorch_1702400431970/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False

what would be the sollution of this problem.

I am sharing the output of my nvidia-smi

Thu Jan 11 10:51:38 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 546.33 CUDA Version: 12.3 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4080 On | 00000000:18:00.0 Off | N/A |
| 0% 46C P8 14W / 320W | 17MiB / 16376MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce RTX 4080 On | 00000000:5E:00.0 Off | N/A |
| 0% 37C P8 12W / 320W | 0MiB / 16376MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce RTX 4080 On | 00000000:B0:00.0 Off | N/A |
| 0% 41C P8 8W / 320W | 0MiB / 16376MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce RTX 4080 On | 00000000:D8:00.0 Off | N/A |
| 0% 43C P8 14W / 320W | 59MiB / 16376MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 270 G /Xwayland N/A |
| 1 N/A N/A 270 G /Xwayland N/A |
| 2 N/A N/A 270 G /Xwayland N/A |
| 3 N/A N/A 270 G /Xwayland N/A |
±--------------------------------------------------------------------------------------+

Could you describe in which setup the PyTorch Distributed use case works fine and why a simple CUDA check fails?

Sorry for unclear information. 3 days ago i have installed NCCL into the system for working with two GPUS. i have done all the installing and environment settings. i have 2 GPUs and i have installed environment on WSL.

conda create --name video python=3.8

After that i have installed

CUDA 11.7

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

It was working fine, torch distributed running was activated while running the code.

yesterday i have added 2 more same GPUs into the system and restarted the system. while checking NCCL is was responding that four GPUs are active. but when i activate the previous environment and try to check whether torch is accessing the GPUs or not. i got error.