Comfy_UI:Attempting to use hipBLASLt on a unsupported architecture!

Hello,

I’m trying to run Comfy_UI on my RX 7900 TX. I tried installing Rocm and the nightly build of pytorch as per the pytorch installation guide on the official website in a Fedora distrobox, and I also tried using the pre-built ubuntu+rocm+pytorch docker image from AMD’s site (PyTorch on ROCm — ROCm installation (Linux)). Both cases Comfy_UI starts up as normal but as soon as I queue an operation, its shuts down with the error “Attempting to use hipBLASLt on a unsupported architecture!
From what I’ve read this issue on gfx1100 cards was supposed to be fixed in pytorch 2.5.1 somewhere around Oct 2024. Just to see if it was Rocm that was causing the problem, I updated the the the AMD pre-built container to Rocm-6.3 from 6.2.4 but nothing changed. I don’t know what should be my next step troubleshooting, especially since the AMD assembled and tested container does not work either.
Please Help.

Host system:

Operating System: Debian GNU/Linux 12
KDE Plasma Version: 6.2.5
KDE Frameworks Version: 6.10.0
Qt Version: 6.7.2
Kernel Version: 6.12.9-amd64 (64-bit)
Graphics Platform: Wayland
Processors: 24 × AMD Ryzen 9 7900X 12-Core Processor
Memory: 61.9 GiB of RAM
Graphics Processor: AMD Radeon RX 7900 XT
Manufacturer: ASUS

Confy_ui output on the AMD pre-buit container

Total VRAM 20464 MB, total RAM 63432 MB
pytorch version: 2.6.0.dev20241122+rocm6.2
Set vram state to: NORMAL_VRAM
Device: cuda:0 Radeon RX 7900 XT : native
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
[Prompt Server] web root: /home/sersys/pyproj/ComfyUI-0.3.12/web

Import times for custom nodes:
   0.0 seconds: /home/sersys/pyproj/ComfyUI-0.3.12/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
/home/sersys/pyproj/ComfyUI-0.3.12/comfy/ops.py:64: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:296.)
  return torch.nn.functional.linear(input, weight, bias)
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
  0%|                                                                | 0/20 [00:00<?, ?it/s]:0:rocdevice.cpp            :2984: 93862815501 us: [pid:318562 tid:0x7f122e5ff640] Callback: Queue 0x7f0ed8000000 aborting with error : HSA_STATUS_ERROR_OUT_OF_REGISTERS: Kernel has requested more VGPRs than are available on this agent code: 0x2d
Aborted (core dumped)

The Pytorch used is 2.6.0.dev20241122+rocm6.2
Can you plz take the latest Pytorch rocm nightly , the issue should have been fixed.

Hello,
Thank you for the suggestion. Apparently I’ve forgot to include the output of my other, Fedora distrobox in my previous post. This is where I’ve manually installed Rocm and Pytorch separately. I have indeed tried the latest version but just for sanity check I ran the instructions on the Pytorch installation guide page:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://download.pytorch.org/whl/nightly/rocm6.3
Requirement already satisfied: torch in ./.local/lib/python3.13/site-packages (2.7.0.dev20250110+rocm6.3)
Requirement already satisfied: torchvision in ./.local/lib/python3.13/site-packages (0.22.0.dev20250110+rocm6.3)
Requirement already satisfied: torchaudio in ./.local/lib/python3.13/site-packages (2.6.0.dev20250110+rocm6.3)
Requirement already satisfied: filelock in ./.local/lib/python3.13/site-packages (from torch) (3.16.1)
Requirement already satisfied: typing-extensions>=4.10.0 in ./.local/lib/python3.13/site-packages (from torch) (4.12.2)
Requirement already satisfied: setuptools in ./.local/lib/python3.13/site-packages (from torch) (72.1.0)
Requirement already satisfied: sympy==1.13.1 in ./.local/lib/python3.13/site-packages (from torch) (1.13.1)
Requirement already satisfied: networkx in ./.local/lib/python3.13/site-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in ./.local/lib/python3.13/site-packages (from torch) (3.1.4)
Requirement already satisfied: fsspec in ./.local/lib/python3.13/site-packages (from torch) (2024.10.0)
Requirement already satisfied: pytorch-triton-rocm==3.2.0+git0d4682f0 in ./.local/lib/python3.13/site-packages (from torch) (3.2.0+git0d4682f0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.local/lib/python3.13/site-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy in ./.local/lib/python3.13/site-packages (from torchvision) (2.1.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in ./.local/lib/python3.13/site-packages (from torchvision) (11.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in ./.local/lib/python3.13/site-packages (from jinja2->torch) (2.1.5)

This is the output of comfy_ui:

Checkpoint files will always be loaded safely.
Total VRAM 20464 MB, total RAM 63431 MB
pytorch version: 2.7.0.dev20250110+rocm6.3
Set vram state to: NORMAL_VRAM
Device: cuda:0 Radeon RX 7900 XT : native
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
[Prompt Server] web root: /home/sersys/pyproj/ComfyUI-0.3.12/web

Import times for custom nodes:
   0.0 seconds: /home/sersys/pyproj/ComfyUI-0.3.12/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
/home/sersys/.local/lib/python3.13/site-packages/torch/functional.py:407: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:328.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
  0%|                                                                                                                           | 0/20 [00:00<?, ?it/s]:0:rocdevice.cpp            :3020: 134126665871d us:  Callback: Queue 0x7ff2c0100000 aborting with error : HSA_STATUS_ERROR_OUT_OF_REGISTERS: Kernel has requested more VGPRs than are available on this agent code: 0x2d
Aborted (core dumped)

I’ve also checked the AMD provided pre-built Ubuntu container that I’ve, by now updated, trying to fix the issue:

$pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://download.pytorch.org/whl/nightly/rocm6.3
Requirement already satisfied: torch in ./.local/lib/python3.10/site-packages (2.6.0)
Requirement already satisfied: torchvision in ./.local/lib/python3.10/site-packages (0.21.0)
Requirement already satisfied: torchaudio in ./.local/lib/python3.10/site-packages (2.6.0)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in ./.local/lib/python3.10/site-packages (from torch) (12.4.127)
Requirement already satisfied: sympy==1.13.1 in ./.local/lib/python3.10/site-packages (from torch) (1.13.1)
Requirement already satisfied: typing-extensions>=4.10.0 in ./.local/lib/python3.10/site-packages (from torch) (4.12.2)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in ./.local/lib/python3.10/site-packages (from torch) (11.2.1.3)
Requirement already satisfied: networkx in ./.local/lib/python3.10/site-packages (from torch) (3.4.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in ./.local/lib/python3.10/site-packages (from torch) (2.21.5)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in ./.local/lib/python3.10/site-packages (from torch) (12.3.1.170)
Requirement already satisfied: fsspec in ./.local/lib/python3.10/site-packages (from torch) (2024.10.0)
Requirement already satisfied: jinja2 in ./.local/lib/python3.10/site-packages (from torch) (3.1.4)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in ./.local/lib/python3.10/site-packages (from torch) (12.4.5.8)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in ./.local/lib/python3.10/site-packages (from torch) (0.6.2)
Requirement already satisfied: filelock in ./.local/lib/python3.10/site-packages (from torch) (3.16.1)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in ./.local/lib/python3.10/site-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in ./.local/lib/python3.10/site-packages (from torch) (10.3.5.147)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in ./.local/lib/python3.10/site-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in ./.local/lib/python3.10/site-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in ./.local/lib/python3.10/site-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in ./.local/lib/python3.10/site-packages (from torch) (9.1.0.70)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in ./.local/lib/python3.10/site-packages (from torch) (11.6.1.9)
Requirement already satisfied: triton==3.2.0 in ./.local/lib/python3.10/site-packages (from torch) (3.2.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.local/lib/python3.10/site-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy in ./.local/lib/python3.10/site-packages (from torchvision) (2.1.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in ./.local/lib/python3.10/site-packages (from torchvision) (11.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in ./.local/lib/python3.10/site-packages (from jinja2->torch) (2.1.5)

Previously I got the same comfi_ui output as above but now it’s different. Probably restarting the distrobox did something.
However now it’s looking for nvidia drivers for some reason?

Checkpoint files will always be loaded safely.
Traceback (most recent call last):
  File "/home/sersys/pyproj/ComfyUI-0.3.12/main.py", line 136, in <module>
    import execution
  File "/home/sersys/pyproj/ComfyUI-0.3.12/execution.py", line 13, in <module>
    import nodes
  File "/home/sersys/pyproj/ComfyUI-0.3.12/nodes.py", line 22, in <module>
    import comfy.diffusers_load
  File "/home/sersys/pyproj/ComfyUI-0.3.12/comfy/diffusers_load.py", line 3, in <module>
    import comfy.sd
  File "/home/sersys/pyproj/ComfyUI-0.3.12/comfy/sd.py", line 6, in <module>
    from comfy import model_management
  File "/home/sersys/pyproj/ComfyUI-0.3.12/comfy/model_management.py", line 166, in <module>
    total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
  File "/home/sersys/pyproj/ComfyUI-0.3.12/comfy/model_management.py", line 129, in get_torch_device
    return torch.device(torch.cuda.current_device())
  File "/home/sersys/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 971, in current_device
    _lazy_init()
  File "/home/sersys/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

This seems to be known bug and ironically the only solution is to downgrade PyTorch in which case I’m facing the hipBLASLt issue again.

I can no longer install pytorch.

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://download.pytorch.org/whl/nightly/rocm6.3
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/rocm6.3/torch-2.7.0.dev20250206%2Brocm6.3-cp310-cp310-manylinux_2_28_x86_64.whl (4323.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 4.3/4.3 GB 35.6 MB/s eta 0:00:01
ERROR: Exception:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/base_command.py", line 165, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
    return func(self, options, args)
  File "/usr/lib/python3/dist-packages/pip/_internal/commands/install.py", line 339, in run
    requirement_set = resolver.resolve(
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 94, in resolve
    result = self._result = resolver.resolve(
  File "/usr/lib/python3/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/usr/lib/python3/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None)
  File "/usr/lib/python3/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
    if not criterion.candidates:
  File "/usr/lib/python3/dist-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
    return bool(self._sequence)
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
    return any(self)
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
    candidate = func()
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/factory.py", line 215, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 288, in __init__
    super().__init__(
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 158, in __init__
    self.dist = self._prepare()
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 227, in _prepare
    dist = self._prepare_distribution()
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 299, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
  File "/usr/lib/python3/dist-packages/pip/_internal/operations/prepare.py", line 487, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/usr/lib/python3/dist-packages/pip/_internal/operations/prepare.py", line 532, in _prepare_linked_requirement
    local_file = unpack_url(
  File "/usr/lib/python3/dist-packages/pip/_internal/operations/prepare.py", line 214, in unpack_url
    file = get_http_url(
  File "/usr/lib/python3/dist-packages/pip/_internal/operations/prepare.py", line 94, in get_http_url
    from_path, content_type = download(link, temp_dir.path)
  File "/usr/lib/python3/dist-packages/pip/_internal/network/download.py", line 146, in __call__
    for chunk in chunks:
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/progress_bars.py", line 304, in _rich_progress_bar
    for chunk in iterable:
  File "/usr/lib/python3/dist-packages/pip/_internal/network/utils.py", line 63, in response_chunks
    for chunk in response.raw.stream(
  File "/usr/lib/python3/dist-packages/pip/_vendor/urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/lib/python3/dist-packages/pip/_vendor/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 96, in read
    self._close()
  File "/usr/lib/python3/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 76, in _close
    self.__callback(result)
  File "/usr/lib/python3/dist-packages/pip/_vendor/cachecontrol/controller.py", line 331, in cache_response
    self.serializer.dumps(request, response, body),
  File "/usr/lib/python3/dist-packages/pip/_vendor/cachecontrol/serialize.py", line 70, in dumps
    return b",".join([b"cc=4", msgpack.dumps(data, use_bin_type=True)])
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 885, in pack
    self._pack(obj)
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 864, in _pack
    return self._pack_map_pairs(
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 970, in _pack_map_pairs
    self._pack(v, nest_limit - 1)
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 864, in _pack
    return self._pack_map_pairs(
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 970, in _pack_map_pairs
    self._pack(v, nest_limit - 1)
  File "/usr/lib/python3/dist-packages/pip/_vendor/msgpack/fallback.py", line 821, in _pack
    raise ValueError("Memoryview is too large")
ValueError: Memoryview is too large

I’ve finally managed to install the latest Rocm and PyTorch versions and got it running:

$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.3.2.60302-66~24.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.3.0.60302-66~24.04), hipblaslt (= 0.10.0.60302-66~24.04), hipfft (= 1.0.17.60302-66~24.04), hipsolver (= 2.3.0.60302-66~24.04), hipsparse (= 3.1.2.60302-66~24.04), hiptensor (= 1.4.0.60302-66~24.04), miopen-hip (= 3.3.0.60302-66~24.04), half (= 1.12.0.60302-66~24.04), rccl (= 2.21.5.60302-66~24.04), rocalution (= 3.2.1.60302-66~24.04), rocblas (= 4.3.0.60302-66~24.04), rocfft (= 1.0.31.60302-66~24.04), rocrand (= 3.2.0.60302-66~24.04), hiprand (= 2.11.1.60302-66~24.04), rocsolver (= 3.27.0.60302-66~24.04), rocsparse (= 3.3.0.60302-66~24.04), rocm-core (= 6.3.2.60302-66~24.04), hipsparselt (= 0.2.2.60302-66~24.04), composablekernel-dev (= 1.1.0.60302-66~24.04), hipblas-dev (= 2.3.0.60302-66~24.04), hipblaslt-dev (= 0.10.0.60302-66~24.04), hipcub-dev (= 3.3.0.60302-66~24.04), hipfft-dev (= 1.0.17.60302-66~24.04), hipsolver-dev (= 2.3.0.60302-66~24.04), hipsparse-dev (= 3.1.2.60302-66~24.04), hiptensor-dev (= 1.4.0.60302-66~24.04), miopen-hip-dev (= 3.3.0.60302-66~24.04), rccl-dev (= 2.21.5.60302-66~24.04), rocalution-dev (= 3.2.1.60302-66~24.04), rocblas-dev (= 4.3.0.60302-66~24.04), rocfft-dev (= 1.0.31.60302-66~24.04), rocprim-dev (= 3.3.0.60302-66~24.04), rocrand-dev (= 3.2.0.60302-66~24.04), hiprand-dev (= 2.11.1.60302-66~24.04), rocsolver-dev (= 3.27.0.60302-66~24.04), rocsparse-dev (= 3.3.0.60302-66~24.04), rocthrust-dev (= 3.3.0.60302-66~24.04), rocwmma-dev (= 1.6.0.60302-66~24.04), hipsparselt-dev (= 0.2.2.60302-66~24.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1058 B
APT-Manual-Installed: yes
APT-Sources: http://repo.radeon.com/rocm/apt/6.3.2 noble/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack

and

pip show torch
Name: torch
Version: 2.7.0.dev20250206+rocm6.3
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages
Requires: filelock, fsspec, jinja2, networkx, pytorch-triton-rocm, setuptools, sympy, typing-extensions
Required-by: kornia, spandrel, torchaudio, torchsde, torchvision

but I’m still getting the hipBLASLt issue

Checkpoint files will always be loaded safely.
Total VRAM 20464 MB, total RAM 63431 MB
pytorch version: 2.7.0.dev20250206+rocm6.3
Set vram state to: NORMAL_VRAM
Device: cuda:0 Radeon RX 7900 XT : native
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
ComfyUI version: 0.3.14
[Prompt Server] web root: /home/sersys/pyproj/ComfyUI-0.3.14/web

Import times for custom nodes:
   0.0 seconds: /home/sersys/pyproj/ComfyUI-0.3.14/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load SD1ClipModel
loaded completely 7118.8 235.84423828125 True
/home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages/torch/functional.py:408: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:328.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
Requested to load BaseModel
loaded completely 6595.45478515625 1639.406135559082 True
  0%|                                                                                                                | 0/20 [00:00<?, ?it/s]:0:rocdevice.cpp            :3020: 18822005164d us:  Callback: Queue 0x7fdb5c400000 aborting with error : HSA_STATUS_ERROR_OUT_OF_REGISTERS: Kernel has requested more VGPRs than are available on this agent code: 0x2d
Aborted (core dumped)

There was a similar signature coming from [ROCm] Fix ADDMM hipBLASLt regression by naromero77amd · Pull Request #138267 · pytorch/pytorch · GitHub which got fixed. That was fixed in the PR, its a TORCH_CHECK which can lead to run time error.
However in the above stack, hipBLASLt error is coming from a different file. pytorch/aten/src/ATen/Context.cpp at main · pytorch/pytorch · GitHub and its TORCH_WARN_ONCE which should not trigger segfault.
Can you check the core dumps and confirm the backtrace.
I see this error “rocdevice.cpp :3020: 18822005164d us: Callback: Queue 0x7fdb5c400000 aborting with error : HSA_STATUS_ERROR_OUT_OF_REGISTERS: Kernel has requested more VGPRs than are available on this agent code: 0x2d”. Might not be Pytorch issue.
Also can you plz provide the comyfi ui command to repro the bug ?

How do I “confirm backtrace”?
I’ve managed to get the core dumps to generate but I don’t really know what to do with it. I tried to use gdb but I’m really lost especially since the python environment and the Rocm installation are inside a distrobox and I get multiple messages that files and directories are either not found or could not be opened. Also the bt command seems to backtrack Python3 instead of main.py. I’m not even sure gdb is the right tool to use either.

How do I get the commands out of comfy_ui?
I don’t know how to use the cli version of comfy_ui and I couldn’t find any guide on how to print the command that was used by the gui. If it helps, I’m using the example workflow that comes with comfi_ui with some checkpoints I’ve downloaded. I’ve tried using different checkpoints to see if that’s the problem but the result is always the same error.

I’ve managed to run gbd without errors apart from that failed download and I think it’s giving me the right information now. I don’t know if any of this is useful but it gave me this:

[Thread debugging using libthread_db enabled]                                                                                               
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 /home/sersys/pyproj/ComfyUI-0.3.14/main.py'.
Program terminated with signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

warning: 44     ./nptl/pthread_kill.c: No such file or directory
[Current thread is 1 (Thread 0x7f7d543ff6c0 (LWP 556574))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f813841427e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f81383f78ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f80eacd698c in amd::roc::callbackQueue(hsa_status_t, hsa_queue_s*, void*) ()
   from /home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#6  0x00007f8079e4e7b7 in bool rocr::AMD::AqlQueue::DynamicQueueEventsHandler<true>(long, void*) ()
   from /home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#7  0x00007f8079e793b1 in rocr::core::Runtime::AsyncEventsLoop(void*) ()
   from /home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#8  0x00007f8079e2ae77 in rocr::os::ThreadTrampoline(void*) ()
   from /home/sersys/pyproj/rocmtorch1/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#9  0x00007f813846baa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#10 0x00007f81384f8c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Hey @sersys ,
I have a couple thoughts here.

There appears to be two issues here :

  1. Pytorch complains that hipblaslt backend is not supported for your GPU
  2. Execution fails with the claim that too many vector general purpose registers are being requested

For 1), it appears non-fatal, but I can’t seem to find anything in the rocm docs that suggests the Radeon RX 7900 XT GPU is not supported for hipblasLT . I’m circling back with AMD’s hipblaslt and pytorch teams about this warning to see if we can get more information.

For 2), my understanding is that pytorch will compile some kernels on the fly via MIOpen for the architecture that is detected at runtime. That dynamic compilation may assume a maximum number of threads per workgroup which influences vgpr/thread allocations. Comfy_UI or pytorch may be launching the kernel with more threads per workgroup than may be allowed causing this error. What do you have the batch size set to in comfy_ui ? Another thing you might try is fixing all of the precision to float16 (rather than the mixed precision you currently have going on). I’m not an expert in comfy_ui, but perhaps they have this as a setting somewhere ?

Interestingly, the Debian group seems to have caught a similar " out of registers " bug in their CI for the same GPU but in a lapack/float_complex unit test (Bug#1078724: W7800 (Navi 31; gfx1100): incorrect VGPR count)

If you’re able to share your comfy_ui workflow file, I’m happy to try and reproduce the issue on matching hardware.

Hey @sersys ,
I’ve opened a ticket on your behalf at [Issue]: Comfy_UI hipblasLT not supported for Radeon 7900XT and HSA_STATUS_ERROR_OUT_OF_REGISTERS error · Issue #4437 · ROCm/ROCm · GitHub

With regards to the unsupported GPU in hipblaslt issue, AMD states this issue has been resolved in the pytorch with ROCm 6.3 nightly builds.

However, I’ve pressed them on the report here that shows you’re using a nightly build from February 6, 2025 that appears to show the “hipBLASLt on an unsupported architecture” warning. Note that this just means that BLAS calls will fall back to hipBLAS and is not fatal.

This is likely unrelated to the issue with the code crashing with an out of registers error. Can you re-run the same example, in the same environment, but with the following environment variables set

export AMD_LOG_LEVEL=3
export HSAKMT_DEBUG_LEVEL=4

When you re-run, please post the output here, or in the referenced github issue above. Would love to help you get some resolution on this one.

Hey @sersys ,
The last thing you could do to help debug the issue is to run this minimal

#!/usr/bin/env python
import torch

print(torch.version.hip)
print(torch.cuda.get_device_properties())

We’re looking to verify in your environment tha the ROCm version is indeed >= 6.3 . Additionally, we’re also looking to confirm gcnArchName=='gfx1100'.

I’ve been unable to reproduce the same warning related to hipblaslt on a number of Radeon cards (including RX7900XT) even with the same version of pytorch you used (the nightly build from February 6, 2025).

I’m curious to see what you find, but I’m starting to suspect issues with comfy_ui picking up the correct python environment.

Hello,

Thank you very much for your suggestion even more for looking into this in such detail, going as far as to opening a ticket. I truly appreciate it.

I have not set any batch size. It should be the default, what ever that value is. I couldn’t find any batch size related options that would be relevant to threads, also every batch size related google result I could find is about batches of pictures generated or loaded.

While I am using Debian, I run pytorch in the AMD pre-built Ubuntu container where Rocm is preinstalled. (PyTorch on ROCm — ROCm installation (Linux))

I’m just using the default workflow that comes with comfy_ui. According to Google the workflow files are either a .war or .yaml files. I couldn’t find any of them. Here are the .json files but as far as I can tell these are only gui related info.

https://www.dropbox.com/scl/fi/g61jr192gkfiseayi9g62/Default-Workflow.json?rlkey=i97rem0eicjjxgczyfh2h4fy2&st=tnzrgdh9&dl=0

dropbox .com/scl/fi/25k9booej0r04koh5nshv/Default-Workflow-api.json?rlkey=k9fiqqbi1d90rn1i7h6uhuuaq&st=gknmpztx&dl=0
(sorry for the incomplete link, new users can only include 2 links per posts)

Here is the output when I run comfy_ui with

export AMD_LOG_LEVEL=3
export HSAKMT_DEBUG_LEVEL=4

dropbox .com/scl/fi/7bdq6lpvi9vgmp2r1qq6x/amddebugpytorch.txt?rlkey=lya2h4ksut2grcp7lw8a9p5ub&st=yekuf7ft&dl=0

Here is the output of

print(torch.version.hip)
print(torch.cuda.get_device_properties())
6.3.42131-fa1d09cbd
_CudaDeviceProperties(name='Radeon RX 7900 XT', major=11, minor=0, gcnArchName='gfx1100', total_memory=20464MB, multi_processor_count=42, uuid=30303033-6430-3863-3030-303030303030, L2_cache_size=6MB)

And the longer version (probably due to the export AMD_LOG_LEVEL=3 and export HSAKMT_DEBUG_LEVEL=4 vars)
dropbox .com/scl/fi/a2k5na579f2vg6lf3pig5/pyver.txt?rlkey=qhwbauuk70ltpkaxil1wiyvne&st=lelsct8x&dl=0

Once again thank you for helping me.

On your debian host system, what is your kernel version ? Specifically, I’m looking for the output of

uname -r

from your host system that you are running docker run from

uname -r

6.12.12-amd64

When running with Docker, my understanding is that the docker container uses the Linux kernel from the host system.

From the compatibility matrix shown at System requirements (Linux) — ROCm installation (Linux) , you are one minor version ahead of the most recent supported linux kernel. This can possibly cause some problems.

I’ll get your logs posted to the github issue to get some feedback from AMD as well. On our side, we’ll try running Comfy_UI on a gfx1100 system we have with a few different linux kernels, including 6.12 to see if this is indeed related to the problem.

Hey @sersys,
From the logs you’ve shared, it looks like it’s picking up two GPUs … one is gfx1100 and the other is a gfx1036 gpu

:3:rocdevice.cpp            :1801: 133669785946d us:  Gfx Major/Minor/Stepping: 11/0/0
:3:rocdevice.cpp            :1803: 133669785949d us:  HMM support: 0, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1805: 133669785950d us:  Max SDMA Read Mask: 0x3, Max SDMA Write Mask: 0x3
:3:rocdevice.cpp            :235 : 133669786607d us:  Numa selects cpu agent[0]=0x41b424a0(fine=0x3dcb4590,coarse=0x448ef230) for gpu agent=0x44f26d20 CPU<->GPU XGMI=0
:3:rocsettings.cpp          :287 : 133669786610d us:  Using dev kernel arg wa = 0
:3:rocdevice.cpp            :1801: 133669786959d us:  Gfx Major/Minor/Stepping: 10/3/6
:3:rocdevice.cpp            :1803: 133669786962d us:  HMM support: 0, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1805: 133669786963d us:  Max SDMA Read Mask: 0x1, Max SDMA Write Mask: 0x1
:3:hip_context.cpp          :49  : 133669787646d us:  Direct Dispatch: 1

The gfx1036 GPU may very likely be related to the unsupported GPU in hipblasLt warning. If the kernels are being dispatched to that GPU, this could also be related to your core dump issue.

Can you share the output of rocminfo ?

Edit:
Digging around, it looks like the gfx1036 is an iGPU. You may need to disable the iGPU on your system : Prerequisites to use ROCm on Radeon desktop GPUs for machine learning development — Use ROCm on Radeon GPUs

Right now I can’t restart my system to disable iGPU but I’ll get back on it once I manage to get around it. In the mean while this is my rocminfo:
https://www.dropbox.com/scl/fi/6fiffu761rjqa3lp0wp5m/rocminfo.txt?rlkey=twc0psrw87y6ws9xhvgzu4hfa&st=b8fkfzkn&dl=0

I’ve disabled the iGPU but I still get the same result.
https://www.dropbox.com/scl/fi/a1nezujl7uk88fr9utqbw/igpudisabled.txt?rlkey=fjkus9br27n1d64wvk1kb7q2n&st=1eovl9dw&dl=0

Also about the kernel version: this is because I’m using the Debian-testing repos however, I’ve tried running comfy_ui when the kernel was still on an earlier version but I got the same result back than too.

With the iGPU disabled, it now looks like it’s only picking up the gfx1100 and the warning about hipBLASLt being used on an unsupported architecture is gone.

One other thing that might be going on here. On another issue related to Comfy_UI and Pytorch with ROCm, we discovered that Comfy_UI requires python 3.12 . ROCm’s pytorch wheels are built against python 3.10, which may cause some problems.

To work around this, you can try uninstalling your current torch , torchvision, and torchaudio packages, and install pytorch from source ( GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration ) using python 3.12, which is what appears to be used in your environment.

Edit : AMD has noted that the notice about python 3.10 is outdated. Also, just recalled that you’re not on WSL2… too many issues we’re tracking that are similar.