Torch is not able to use GPU; (Ubuntu)

Hi! I just migrated to Ubuntu on my Asus Tuf Laptop and am having difficulty getting Stable Diffusion via Automatic1111 repo up and running due to pytorch not being able to use my GPU.

My setup:
I installed nvidia drivers via apt (tried 525, also didn’t work). Currently nvidia-smi gives me:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     1W /  N/A |    203MiB /  4096MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1692      G   /usr/lib/xorg/Xorg                 81MiB |
|    0   N/A  N/A      1947      G   /usr/bin/gnome-shell              119MiB |
+-----------------------------------------------------------------------------+

My lshw output:

$ sudo lshw -c display
[sudo] password for sd: 
  *-display                 
       description: VGA compatible controller
       product: TU117M [GeForce GTX 1650 Mobile / Max-Q]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:81 memory:f6000000-f6ffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:f000(size=128) memory:f7000000-f707ffff
  *-display
       description: VGA compatible controller
       product: Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:05:00.0
       logical name: /dev/fb0
       version: c2
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list fb
       configuration: depth=32 driver=amdgpu latency=0 resolution=1920,1080
       resources: irq:24 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:c000(size=256) memory:f7500000-f757ffff

But my output from stable diffusion gives me:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: 226d840e84c5f306350b0681945989b86760e616
Traceback (most recent call last):
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 360, in <module>
    prepare_environment()
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 272, in prepare_environment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 129, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/launch.py", line 105, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/bin/python3" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 1
stdout: <empty>
stderr: /home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

If I open up python within the venv I see that cuda knows that there is one device, but the moment I try to get more info or do anything it errors out:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice
>>> torch.cuda.is_available()
/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:110.)
  return torch._C._cuda_getDeviceCount() > 0
False

I am very lost at this point and have tried nvidia’s PPA drivers as well as now the Ubuntu drivers and can’t seem to get any setup that makes it passed this point. I appreciate any and all help you can offer. Thank you!

This error:

>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/sd/stable_diffusion_stuff/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice
>>> torch.cuda.is_available()

indicates that this PyTorch calls while trying to initialize your AMD GPU (see the hip tag in the error classes) as it seems you re using an NVIDIA GeForce GTX 1650 as well as another AMD device.
I’m not familiar with your setup, but would it be possible to disable the AMD device while you are running Stable Diffusion on your NVIDIA GPU?

@ptrblck Thank you for responding! I think you are correct, as when I run pip3 list I get:

torch              1.13.1+rocm5.2
torchvision        0.14.1+rocm5.2

So it looks like the Stable Diffusion script saw my AMD and installed that version of pytorch instead of the regular one? I’m going to try again to delete the venv and see if I can find a way to disable the amd gpu before recreating the environment. Will update here soon.

Yes, I think you are right and indeed the rocm version was installed.
I don’t know how you are installing PyTorch (and other dependencies) in your environment, but maybe it’s possible to pre-install PyTorch with e.g. CUDA 11.7 via pip install torch as described in the install instructions.

Thank you!! I was unable to find in the code why/where it kept getting ROCM from, but I went into the venv and manually uninstalled torch and torchvision and then reinstalled and it works now! Thank you so much!

1 Like