Pytorch cuda problem on gpu

dat1 · July 29, 2021, 7:22am

Hii pytorch developer, i am trying to use pytorch on gpu, however it shows ValueError: invalid literal for int() with base 10: “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

additionally, it shows
/home/seis/ret/prog/anaconda3/envs/cpi/lib/python3.7/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: Download Drivers | NVIDIA Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729066392/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False

Please suggest which pytorch version will support The NVIDIA driver on your system is too old (found version 9010)

ptrblck · July 29, 2021, 7:29am

You can take a look at this doc in particular Table 3 shows the driver requirement for each CUDA release.

dat1 · July 29, 2021, 7:39am

Which pytorch version will be compatible with NVIDIA driver version 9010.Can you please suggest, because what the gpu i am using it is older and the engineer doesn’t want to upgrade it to the latest version of NVIDIA driver.

Manuel_Alejandro_Dia · July 29, 2021, 7:45am

You could check on the previous Pytorch versions webpage to see which Pytorch version will work with the CUDA driver you have.

dat1 · July 29, 2021, 8:12am

@Manuel_Alejandro_Dia ```
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch

dat1 · July 29, 2021, 8:14am

I think cudatoolkit==9.2 will not work so i need cudatoolkit==9.1 how to get it ???

Manuel_Alejandro_Dia · July 29, 2021, 8:20am

Maybe you could try with:

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch

dat1 · July 29, 2021, 9:39am

getting error again

In [1]: import torch

In [2]: torch.version.cuda
Out[2]: ‘9.0.176’

In [3]: torch.cuda.is_available()
Out[3]: False

Manuel_Alejandro_Dia · August 4, 2021, 12:24pm

Can you post the output of nvidia-smi if you are in Linux?

Also, maybe check this stackoverflow thread.

Subhajit_Chatterjee · January 11, 2024, 2:20am

Hello,
I have windows operating system, where i am using WSL for running linux based codes. I have NCCl installed and its workinh fine for distributed pytorch. I have checked it using python script.

Now I have installed torch

when I am importing torch and checking torch.cuda.is_available()

Getting error,
/home/joy/miniconda3/envs/VideoMae/lib/python3.8/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /opt/conda/conda-bld/pytorch_1702400431970/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False

what would be the sollution of this problem.

I am sharing the output of my nvidia-smi

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 270 G /Xwayland N/A |
| 1 N/A N/A 270 G /Xwayland N/A |
| 2 N/A N/A 270 G /Xwayland N/A |
| 3 N/A N/A 270 G /Xwayland N/A |
±--------------------------------------------------------------------------------------+

ptrblck · January 11, 2024, 3:01am

Could you describe in which setup the PyTorch Distributed use case works fine and why a simple CUDA check fails?

Subhajit_Chatterjee · January 11, 2024, 3:09am

Sorry for unclear information. 3 days ago i have installed NCCL into the system for working with two GPUS. i have done all the installing and environment settings. i have 2 GPUs and i have installed environment on WSL.

conda create --name video python=3.8

After that i have installed

CUDA 11.7

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

It was working fine, torch distributed running was activated while running the code.

yesterday i have added 2 more same GPUs into the system and restarted the system. while checking NCCL is was responding that four GPUs are active. but when i activate the previous environment and try to check whether torch is accessing the GPUs or not. i got error.