[BUG] RTX5080: Function 'MmBackward0' returned nan values in its 0th output.

I trained Qwen3 with torch and found that it can run normally on 2070 or CPU, but it fails to run on 5080, including stable release(2.7.0) and latest version

RTX 5080                           RTX 2070
cuda 12.8                          cuda 12.9
torch==2.8.0.dev20250613+cu128     torch==2.7.0+cu126
transformers==4.52.4               transformers==4.51.3

Code: I used the locally deployed official Qwen3-0.6B model weights for training.

if you want to reproduce this bug, please do not modify the train_seq

import warnings

import transformers

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="Failed to load image Python extension.*")

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, get_scheduler
from transformers.models.qwen2.tokenization_qwen2_fast import Qwen2TokenizerFast
from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM

from torch.optim import AdamW
import os


if __name__ == '__main__':
    print(torch.__version__)
    print(transformers.__version__)

    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
    model_name = "../dl_models/Qwen3-0.6B"
    device = "cuda"

    tokenizer: Qwen2TokenizerFast = AutoTokenizer.from_pretrained(model_name)
    model: Qwen3ForCausalLM = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    model.model.embed_tokens.weight.requires_grad = False
    optimizer = AdamW(model.parameters(), lr=1e-5, betas=(0.9, 0.95), eps=1e-9)
    optimizer.zero_grad()
    model.train()

    train_seq = ["<think>\n\n</think>\n\n翻译:行我觉得如果我想要在前赶回宿舍的话,我就得尽快把事情做完。<|im_end|>"]
    inputs = tokenizer(train_seq, return_tensors="pt", padding=True).to(device)
    with torch.autograd.detect_anomaly(True):
        outputs = model(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            labels=inputs.input_ids
        )
        loss = outputs.loss
        loss.backward()

It can be seen that the above code is very basic and does not contain any quantization or acceleration operations

But I get RuntimeError: Function 'MmBackward0' returned nan values in its 0th output. on 5080

Here is the full traceback:

D:\PycharmProjects\Qwen3\bug_report.py:35: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly(True):
D:\anaconda3\Lib\site-packages\torch\autograd\graph.py:829: UserWarning: Error detected in MmBackward0. Traceback of forward call that caused the error:
  File "D:\PycharmProjects\Qwen3\bug_report.py", line 36, in <module>
    outputs = model(
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\utils\generic.py", line 969, in wrapper
    output = func(self, *args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 734, in forward
    outputs: BaseModelOutputWithPast = self.model(
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\utils\generic.py", line 969, in wrapper
    output = func(self, *args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 467, in forward
    layer_outputs = decoder_layer(
  File "D:\anaconda3\Lib\site-packages\transformers\modeling_layers.py", line 48, in __call__
    return super().__call__(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 287, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 217, in forward
    key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\Lib\site-packages\torch\nn\modules\linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
 (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\autograd\python_anomaly_mode.cpp:127.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "D:\PycharmProjects\Qwen3\bug_report.py", line 42, in <module>
    loss.backward()
  File "D:\anaconda3\Lib\site-packages\torch\_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "D:\anaconda3\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
    _engine_run_backward(
  File "D:\anaconda3\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Function 'MmBackward0' returned nan values in its 0th output.

I tested the CUDA calculation errors of my 5080 and 2070 many times, and the results are very close

Versions:

PyTorch version: 2.8.0.dev20250613+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 专业版 (10.0.26100 64 位)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: True
CUDA runtime version: 12.9.86
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5080
Nvidia driver version: 576.52
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: AMD Ryzen 7 9700X 8-Core Processor             
Manufacturer: AuthenticAMD
Family: 107
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 3800
MaxClockSpeed: 3800
L2CacheSize: 8192
L2CacheSpeed: None
Revision: 17408

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy==1.11.2
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] numpydoc==1.7.0
[pip3] torch==2.8.0.dev20250613+cu128
[pip3] torchaudio==2.8.0.dev20250614+cu128
[pip3] torchvision==0.23.0.dev20250614+cu128
[pip3] triton-windows==3.3.1.post19
[conda] _anaconda_depends         2024.10             py312_mkl_0  
[conda] blas                      1.0                         mkl  
[conda] mkl                       2023.1.0         h6b88ed4_46358  
[conda] mkl-service               2.4.0           py312h2bbff1b_1  
[conda] mkl_fft                   1.3.10          py312h827c3e9_0  
[conda] mkl_random                1.2.7           py312h0158946_0  
[conda] numpy                     1.26.4          py312hfd52020_0  
[conda] numpy-base                1.26.4          py312h4dde369_0  
[conda] numpydoc                  1.7.0           py312haa95532_0  
[conda] torch                     2.8.0.dev20250613+cu128          pypi_0    pypi
[conda] torchaudio                2.8.0.dev20250614+cu128          pypi_0    pypi
[conda] torchvision               0.23.0.dev20250614+cu128          pypi_0    pypi
[conda] triton-windows            3.3.1.post19             pypi_0    pypi

Is this issue reproducible with a randomly initialized model or do I need to use your checkpoint?

Hi @ptrblck , just use the official checkpoint from HuggingFace: Qwen/Qwen3-0.6B at main