Torch is not able to use GPU; (Ubuntu)

sailboats · February 5, 2023, 10:02am

Hi! I just migrated to Ubuntu on my Asus Tuf Laptop and am having difficulty getting Stable Diffusion via Automatic1111 repo up and running due to pytorch not being able to use my GPU.

My setup:
I installed nvidia drivers via apt (tried 525, also didn’t work). Currently nvidia-smi gives me:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     1W /  N/A |    203MiB /  4096MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1692      G   /usr/lib/xorg/Xorg                 81MiB |
|    0   N/A  N/A      1947      G   /usr/bin/gnome-shell              119MiB |
+-----------------------------------------------------------------------------+

My lshw output:

$ sudo lshw -c display
[sudo] password for sd: 
  *-display                 
       description: VGA compatible controller
       product: TU117M [GeForce GTX 1650 Mobile / Max-Q]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:81 memory:f6000000-f6ffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:f000(size=128) memory:f7000000-f707ffff
  *-display
       description: VGA compatible controller
       product: Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:05:00.0
       logical name: /dev/fb0
       version: c2
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list fb
       configuration: depth=32 driver=amdgpu latency=0 resolution=1920,1080
       resources: irq:24 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:c000(size=256) memory:f7500000-f757ffff

But my output from stable diffusion gives me:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: 226d840e84c5f306350b0681945989b86760e616
Traceback (most recent call last):
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 360, in <module>
    prepare_environment()
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 272, in prepare_environment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 129, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 105, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/bin/python3" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 1
stdout: <empty>
stderr: /home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

If I open up python within the venv I see that cuda knows that there is one device, but the moment I try to get more info or do anything it errors out:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice
>>> torch.cuda.is_available()
/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
False

I am very lost at this point and have tried nvidia’s PPA drivers as well as now the Ubuntu drivers and can’t seem to get any setup that makes it passed this point. I appreciate any and all help you can offer. Thank you!

ptrblck · February 5, 2023, 10:34am

This error:

>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice
>>> torch.cuda.is_available()

indicates that this PyTorch calls while trying to initialize your AMD GPU (see the hip tag in the error classes) as it seems you re using an NVIDIA GeForce GTX 1650 as well as another AMD device.
I’m not familiar with your setup, but would it be possible to disable the AMD device while you are running Stable Diffusion on your NVIDIA GPU?

sailboats · February 5, 2023, 10:41am

@ptrblck Thank you for responding! I think you are correct, as when I run pip3 list I get:

torch              1.13.1+rocm5.2
torchvision        0.14.1+rocm5.2

So it looks like the Stable Diffusion script saw my AMD and installed that version of pytorch instead of the regular one? I’m going to try again to delete the venv and see if I can find a way to disable the amd gpu before recreating the environment. Will update here soon.

ptrblck · February 5, 2023, 10:44am

Yes, I think you are right and indeed the rocm version was installed.
I don’t know how you are installing PyTorch (and other dependencies) in your environment, but maybe it’s possible to pre-install PyTorch with e.g. CUDA 11.7 via pip install torch as described in the install instructions.

sailboats · February 5, 2023, 11:17am

Thank you!! I was unable to find in the code why/where it kept getting ROCM from, but I went into the venv and manually uninstalled torch and torchvision and then reinstalled and it works now! Thank you so much!

Dj_Jag · June 27, 2023, 7:22am

Please Can you Help with my case: Ubuntu20x64 GTX1070
RuntimeError: Torch is not able to use GPU
Cuda-drivers 530 Cuda 12
open-clip-torch 2.7.0
pytorch-lightning 1.9.4
pytorch-triton 2.1.0+440fd1bf20
torch 2.0.1
torchaudio 2.1.0.dev20230626+cu121
torchdiffeq 0.2.3
torchmetrics 0.11.4
torchsde 0.2.5
torchvision 0.16.0.dev20230626+cu121

Dj_Jag · June 27, 2023, 7:30am

import torch

torch.version.cuda :: '11.7"
torch.cuda.device_count() :: 1
torch.cuda.is_initialized():: false
torch.cuda.init() :: err Unexpected error from cudaGetDeviceCount().
torch.cuda.is_initialized():: false
torch.cuda.is_available() :: err + false
torch.cuda.current_device():: err

ptrblck · June 27, 2023, 7:31am

Based on the printed versions it seems you are mixing a “stable” PyTorch version (potentially a CPU-only build) with nightly binaries for torchvision/audio using the CUDA 12.1 dependencies.
I would recommend sticking to one set of binaries by following the install commands from the install matrix.

Dj_Jag · June 28, 2023, 1:59am

Thanks!
Yesterday I followed guides CUDA_Installation_Guide_Linux.pdf
and installed:NVIDIA-Linux-x86_64-535.54.03.run
step by step
NOW: NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2

Dj_Jag · June 28, 2023, 2:02am

You think that this caused but Torch? No problem - Bro, I will reinstall torch to Stable ver.
But I think the Cuda cause this problem. Anyway - my Automatic1111 successfuly started in my mashine Win10, sama driver, standard packages and reqs - no any problem! Worrking in my GTX1070 with 1.5 it/s so I think this is only UBUNTU problem and I still dont know the reason

Can this problem be caused by Xorg? But driver said that updated it
Processes: |

| 0 N/A N/A 943 G /usr/lib/xorg/Xorg 124MiB |

Dj_Jag · June 28, 2023, 2:04am

Still …BUG

no cuda test passing
no Stable Diffusion run

CUDA_VISIBLE_DEVICES=1 python launch.py
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0]
Version: v1.3.2
Commit hash: baf6946e06249c5af9851c60171692c44ef633e0
Traceback (most recent call last):
File “launch.py”, line 38, in
main()
File “launch.py”, line 29, in main
prepare_environment()
File “/media/jag/NEU/3PAX/ubuntu-webui/modules/launch_utils.py”, line 257, in prepare_environment
raise RuntimeError(
RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check