Hi all,
I’m running into a reproducible CUDA kernel failure on an RTX 5090 (sm_120) when using models that rely on ConvNeXtV2 fused kernels. This appears to be due to missing sm_120 support in the current PyTorch builds.
Environment
-
GPU: NVIDIA RTX 5090 (Blackwell, sm_120)
-
Driver: (NVIDIA Studio Driver 596.36)
-
CUDA Toolkit: (12.4)
-
CUDA Version (13.2)
-
PyTorch: (2.6.0)
-
OS: Windows 11
Error
During inference, any model that uses ConvNeXtV2 or similar fused ops fails with:
Code
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
This happens consistently inside the fused ConvNeXtV2 convolution layers:
Code
File ".../convnextv2.py", line 120, in forward
x = self.forward_features(x)
...
File ".../conv.py", line 549, in _conv_forward
return F.conv2d(...)
Summary of the issue
-
PyTorch wheels currently support up to sm_90 (Ada).
-
RTX 50xx GPUs require sm_120 kernels.
-
Fused kernels (ConvNeXtV2, some custom ops) cannot fall back to PTX JIT.
-
As a result, models that rely on these ops fail immediately on Blackwell GPUs.
Request
Could the PyTorch team provide guidance or a timeline for:
-
Official sm_120 support in upcoming PyTorch wheels
-
Rebuilt fused kernels (ConvNeXtV2 and similar ops) targeting sm_120
-
Any nightly builds or experimental wheels that include Blackwell support
-
Whether CUDA 12.8 or later will be required for full compatibility
There are already several users reporting similar issues with RTX 50xx cards, so I wanted to provide a clean repro case and error trace.
Happy to provide additional logs, environment details, or test builds if needed.
Thanks!