Request: Add CUDA sm_120 (Blackwell) support for ConvNeXtV2 / fused kernels

Hi all,

I’m running into a reproducible CUDA kernel failure on an RTX 5090 (sm_120) when using models that rely on ConvNeXtV2 fused kernels. This appears to be due to missing sm_120 support in the current PyTorch builds.

Environment

  • GPU: NVIDIA RTX 5090 (Blackwell, sm_120)

  • Driver: (NVIDIA Studio Driver 596.36)

  • CUDA Toolkit: (12.4)

  • CUDA Version (13.2)

  • PyTorch: (2.6.0)

  • OS: Windows 11

Error

During inference, any model that uses ConvNeXtV2 or similar fused ops fails with:

Code

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This happens consistently inside the fused ConvNeXtV2 convolution layers:

Code

File ".../convnextv2.py", line 120, in forward
    x = self.forward_features(x)
...
File ".../conv.py", line 549, in _conv_forward
    return F.conv2d(...)

Summary of the issue

  • PyTorch wheels currently support up to sm_90 (Ada).

  • RTX 50xx GPUs require sm_120 kernels.

  • Fused kernels (ConvNeXtV2, some custom ops) cannot fall back to PTX JIT.

  • As a result, models that rely on these ops fail immediately on Blackwell GPUs.

Request

Could the PyTorch team provide guidance or a timeline for:

  1. Official sm_120 support in upcoming PyTorch wheels

  2. Rebuilt fused kernels (ConvNeXtV2 and similar ops) targeting sm_120

  3. Any nightly builds or experimental wheels that include Blackwell support

  4. Whether CUDA 12.8 or later will be required for full compatibility

There are already several users reporting similar issues with RTX 50xx cards, so I wanted to provide a clean repro case and error trace.

Happy to provide additional logs, environment details, or test builds if needed.

Thanks!

Please contact the model authors who developed these custom kernels to update them. PyTorch itself supports all Blackwell GPUs since the 2.7 release.

PyTorch does NOT support the sm_120 kernel; which is a part of RTX 5080-5090 and Blackwell gpus; however, which makes it impossible to use. AI tools (such as PyTorch, vLLM, or FlashInfer) that require explicit sm_120 support in their build files or JIT-compilation routines to function without workarounds. I’m not sure who the model authors you refer to are?

PyTorch does support sm_120 since the 2.7.0 release if built with CUDA 12.8 and as shown in:

torch.cuda.get_arch_list()
'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']

If you have any issues with custom kernels, please contact the authors of the repository directly.

Okay, I’ll try it again. Claude code couldn’t get it working, let’s see if I can…

Once you installed the latest build, verify it by printing torch.__version__ , torch.version.cuda, and torch.cuda.get_arch_list(). This should show e.g. 2.12.0 and the CUDA runtime version you selected (pick either 13.0 or 13.2 - do not use 12.6 as you need 12.8+).

A quick check using the latest 2.12.0+cu132 also shows sm_100 and sm_120 are found in the dlls via cuobjdump --list-elf.