Which GPUs are supported by Torch 2

Hello everyone. I am producing Stable Diffusion tutorial videos on my channel : SECourses - Software Engineering Courses - YouTube

Recently I have shown how to install and use Torch 2 : How To Install New DREAMBOOTH & Torch 2 On Automatic1111 Web UI PC For Epic Performance Gains Guide - YouTube

It works for RTX3060 or such cards but several people did comment and said it didn’t work on GTX 1080ti

So where can I find supported GPU list?

Thank you

The current binaries support all architectures between compute capability 3.7 to 9.0.

Could you describe the errors seen on the 1080Ti?

CUDA error: no kernel image is available.

RuntimeError: CUDA error: no kernel image is available for execution on the device

Could you post the output of python -m torch.utils.collect_env, please?

I cant because these are from the comments that my viewers made. I don’t have that card. Mine is RTX 3060 and it is working.

In that case your viewers would need to post this information here, as the current 2.0.0 release works fine on Pascal GPUs:

>>> import torch
>>> torch.__version__
'2.0.0+cu117'
>>> torch.cuda.get_arch_list()
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
>>> x = torch.randn(1).cuda()
>>> x
tensor([-0.7546], device='cuda:0')

>>> import torch
>>> torch.__version__
'2.0.0+cu118'
>>> torch.cuda.get_arch_list()
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>> x = torch.randn(1).cuda()
>>> x
tensor([-0.3417], device='cuda:0')

didn’t work for me
my gpu is 1080 Ti

1 Like

thanks for posting. @ptrblck can you check it?

I switched from 12.0 down in hopes that was the issue but here you go:

(venv) D:\stable-diffusion-webui>python -m torch.utils.collect_env
Collecting environment information…
PyTorch version: 2.0.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: 3.9.1 (branches/release_39)
CMake version: version 3.25.0-rc2
Libc version: N/A

Python version: 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1060 6GB
Nvidia driver version: 527.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3825
DeviceID=CPU0
Family=107
L2CacheSize=3072
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3825
Name=AMD Ryzen 5 1600 Six-Core Processor
ProcessorType=3
Revision=257

Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.7.6
[pip3] torch==2.0.0+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==0.11.4
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.0+cu118
[conda] No relevant packages

1 Like

I received the same errors on my 1060 as sebby621’s 1080 Ti, and I did post my report above.

1 Like

Same here, 1080 TI owner and Torch 2 not working:

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

0%| | 0/20 [00:00<?, ?it/s]
Error completing request00, ?it/s]
Arguments: ('task(gkmg3awt4ativg6)', 'Ripped viking fighting a dragon, photorealistic, ((detailed face)), amazing natural skin tone, 4k textures, soft cinematic light, photoshop, epic scene, art by artgerm and greg rutkowski', '(anime:1.2), (manga:1.2), pigtail, paint, cartoon, render, (areola nipples:1.1), 3d, asian, deformities, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 704, 512, True, 0.7, 2, 'None', 0, 0, 0, [], 0, <scripts.external_code.ControlNetUnit object at 0x0000029294E50640>, <scripts.external_code.ControlNetUnit object at 0x0000029294E50100>, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0, None, None, 50) {}
Traceback (most recent call last):
File "C:\Users\ZeroCool22\Desktop\Auto\modules\call_queue.py", line 56, in f
res = list(func(*args, **kwargs))
File "C:\Users\ZeroCool22\Desktop\Auto\modules\call_queue.py", line 37, in f
res = func(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\txt2img.py", line 56, in txt2img
processed = process_images(p)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\processing.py", line 486, in process_images
res = process_images_inner(p)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\processing.py", line 636, in process_images_inner
samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\processing.py", line 836, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_samplers_kdiffusion.py", line 351, in sample
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_samplers_kdiffusion.py", line 227, in launch_sampling
return func()
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_samplers_kdiffusion.py", line 351, in
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_samplers_kdiffusion.py", line 119, in forward
x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_hijack_utils.py", line 17, in
setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_hijack_utils.py", line 28, in call
return self.__orig_func(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1329, in forward
out = self.diffusion_model(x, t, context=cc)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 776, in forward
h = module(h, emb, context)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 84, in forward
x = layer(x, context)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 324, in forward
x = block(x, context=context[i])
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 259, in forward
return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 114, in checkpoint
return CheckpointFunction.apply(func, len(inputs), *args)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\autograd\function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 129, in forward
output_tensors = ctx.run_function(*ctx.input_tensors)
File "C:\Users\ZeroCool22\Desktop\Auto\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 262, in _forward
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_hijack_optimizations.py", line 342, in xformers_attention_forward
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=get_xformers_flash_attention_op(q, k, v))
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 196, in memory_efficient_attention
return memory_efficient_attention(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 292, in _memory_efficient_attention
return memory_efficient_attention_forward(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 312, in memory_efficient_attention_forward
out, * = op.apply(inp, needs_gradient=False)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha\cutlass.py", line 175, in apply
out, lse, rng_seed, rng_offset = cls.OPERATOR(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch_ops.py", line 502, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.`
1 Like

Still no reproduction from my side on Windows using 2.0.0+cu117 and 2.0.0+cu118 on two additional and different sm_61 devices.

However, I also assume none of your issues is related to my posted code snippet and nobdy used it for any verification?

Based on the stacktrace you’ve posted I would guess xformers is not supporting your GPU architecture:

File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_hijack_optimizations.py", line 342, in xformers_attention_forward
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=get_xformers_flash_attention_op(q, k, v))
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 196, in memory_efficient_attention
return memory_efficient_attention(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 292, in _memory_efficient_attention
return memory_efficient_attention_forward(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 312, in memory_efficient_attention_forward
out, * = op.apply(inp, needs_gradient=False)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha\cutlass.py", line 175, in apply
out, lse, rng_seed, rng_offset = cls.OPERATOR(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\torch_ops.py", line 502, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: no kernel image is available for execution on the device

Could you, @GeneralAwareness, and @FurkanGozukara explain how xformers was build, which build flags were used, and how its GPU architecture support is defined, please?

EDIT: using conda install -c xformers xformers as well as pip install xformers will downgrade PyTorch to 1.13.1 in my setup, so it’s still unclear how you’ve build it.
Building it from source seems to work and I’m able to execute a benchmark via:

python xformers/benchmarks/benchmark_mem_eff_attention.py

which eventually fails with:

...
====== {'shape': (64, 1024, 16, 128), 'num_threads': 1, 'dropout_p': 0.3, 'attn_bias_cfg': (<class 'xformers.ops.fmha.attn_bias.LowerTriangularMask'>, False), 'dtype': torch.float16} ======                      
====== {'shape': (64, 1024, 16, 128), 'num_threads': 1, 'dropout_p': 0.3, 'attn_bias_cfg': (<class 'xformers.ops.fmha.attn_bias.LowerTriangularMask'>, False), 'dtype': torch.bfloat16} ======                     
====== {'shape': (64, 1024, 16, 128), 'num_threads': 1, 'dropout_p': 0.3, 'attn_bias_cfg': (<class 'xformers.ops.fmha.attn_bias.LowerTriangularMask'>, False), 'dtype': torch.float32} ======                      
                                                                                                                                                                                                                   
Traceback (most recent call last):
  File "xformers/benchmarks/benchmark_mem_eff_attention.py", line 356, in <module>
    benchmark_main_helper(benchmark_forward, CASES, min_run_time=min_run_time)
  File "/workspace/xformers/xformers/benchmarks/utils.py", line 478, in benchmark_main_helper
    _render_bar_plot(results_for_print, store_results_folder)
  File "/workspace/xformers/xformers/benchmarks/utils.py", line 294, in _render_bar_plot
    if all_descriptions[0] == "":
IndexError: list index out of range

which seems to be a script issue and unrelated to the used GPU architecture.
Based on this open issue there is also no PyTorch 2.0 support yet which matches my observation about conda and pip trying to downgrade my PyTorch version.
To further help I would need more information than “Torch 2 not working” and it would be great if anyone could run a simple smoke test using pure PyTorch to verify that PyTorch itself is working correctly on your Pascal GPU.

1 Like

No idea about any of that, but this is what we were instructed to do to get this working.

./venv/Scripts/activate
pip install https://download.pytorch.org/whl/cu118/torch-2.0.0%2Bcu118-cp310-cp310-win_amd64.whl https://download.pytorch.org/whl/cu118/torchvision-0.15.0%2Bcu118-cp310-cp310-win_amd64.whl 
pip install --no-deps --force-reinstall  https://github.com/ArrowM/xformers/releases/download/xformers-0.0.17+b6be33a.d20230315-cp310-cu118/xformers-0.0.17+b6be33a.d20230315-cp310-cp310-win_amd64.whl

Could you at least verify that python -c "import torch; print(torch.__version__); print(torch.randn(1).cuda())" works?
Just copy/paste the code into your terminal, execute it, and post the output here.

D:\stable-diffusion-webui>“./venv/scripts/activate”

(venv) D:\stable-diffusion-webui>python -c "import torch; print(torch.version); print(torch.randn(1).cuda())
2.0.0+cu118
tensor([-0.5276], device=‘cuda:0’)

(venv) D:\stable-diffusion-webui>

Thank you! This confirms that PyTorch is able to execute code on your GPU as already mentioned.

Where is the problem then? That craptastic Xformers that has always given me trouble if you attempt to leave its sandbox?

The stacktrace points to the memory_efficient_attention from xformers:

File "C:\Users\ZeroCool22\Desktop\Auto\modules\sd_hijack_optimizations.py", line 342, in xformers_attention_forward
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=get_xformers_flash_attention_op(q, k, v))
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 196, in memory_efficient_attention
return memory_efficient_attention(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 292, in _memory_efficient_attention
return memory_efficient_attention_forward(
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha_init.py", line 312, in memory_efficient_attention_forward
out, * = op.apply(inp, needs_gradient=False)
File "C:\Users\ZeroCool22\Desktop\Auto\venv\lib\site-packages\xformers\ops\fmha\cutlass.py", line 175, in apply
out, lse, rng_seed, rng_offset = cls.OPERATOR(

and the wheels you are installing are created by the user ArrowM as seen in the URL you’ve posted, so I don’t know how these were built.

Do you have another source of those since the video, and the dreambooth extension is using those? LOTS of people are going to use those.

edit: I just wanted to add that I tested it without xformers to be hit with no fp16 support so I have to use --no-half. I used that and I knew what was next, and sure enough OOM. I can do 512x512 though so that eliminates Pytorch as the culprit.

If I build xformers (I have done it before) then it should work, correct? I have had the “opportunity” to communicate with their devs and I rather pull my toes off than go through that again.

Btw, this is the best support I have seen in a long time. Thank you.

edit #2: I built xformers on my machine and it worked. I had to rebuild this so many times with 1.5 and this worked without an issue. On SD 1.5 512x512 euler_a 20 steps 22-25s is now down to 12s on my 1060. Without xformers (and having to do FP32) the time was 14s.

Follow this guide as it has always been full proof for me, but YMMV - https://www.reddit.com/r/StableDiffusion/comments/xz26lq/automatic1111_xformers_cross_attention_with_on/

2.0.0+cu118
tensor([0.2080], device=‘cuda:0’)