I trained Qwen3 with torch and found that it can run normally on 2070 or CPU, but it fails to run on 5080, including stable release(2.7.0) and latest version
RTX 5080 RTX 2070
cuda 12.8 cuda 12.9
torch==2.8.0.dev20250613+cu128 torch==2.7.0+cu126
transformers==4.52.4 transformers==4.51.3
Code: I used the locally deployed official Qwen3-0.6B model weights for training.
if you want to reproduce this bug, please do not modify the train_seq
import warnings
import transformers
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="Failed to load image Python extension.*")
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, get_scheduler
from transformers.models.qwen2.tokenization_qwen2_fast import Qwen2TokenizerFast
from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM
from torch.optim import AdamW
import os
if __name__ == '__main__':
print(torch.__version__)
print(transformers.__version__)
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
model_name = "../dl_models/Qwen3-0.6B"
device = "cuda"
tokenizer: Qwen2TokenizerFast = AutoTokenizer.from_pretrained(model_name)
model: Qwen3ForCausalLM = AutoModelForCausalLM.from_pretrained(model_name).to(device)
model.model.embed_tokens.weight.requires_grad = False
optimizer = AdamW(model.parameters(), lr=1e-5, betas=(0.9, 0.95), eps=1e-9)
optimizer.zero_grad()
model.train()
train_seq = ["<think>\n\n</think>\n\n翻译:行我觉得如果我想要在前赶回宿舍的话,我就得尽快把事情做完。<|im_end|>"]
inputs = tokenizer(train_seq, return_tensors="pt", padding=True).to(device)
with torch.autograd.detect_anomaly(True):
outputs = model(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
labels=inputs.input_ids
)
loss = outputs.loss
loss.backward()
It can be seen that the above code is very basic and does not contain any quantization or acceleration operations
But I get RuntimeError: Function 'MmBackward0' returned nan values in its 0th output.
on 5080
Here is the full traceback:
D:\PycharmProjects\Qwen3\bug_report.py:35: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with torch.autograd.detect_anomaly(True):
D:\anaconda3\Lib\site-packages\torch\autograd\graph.py:829: UserWarning: Error detected in MmBackward0. Traceback of forward call that caused the error:
File "D:\PycharmProjects\Qwen3\bug_report.py", line 36, in <module>
outputs = model(
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\utils\generic.py", line 969, in wrapper
output = func(self, *args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 734, in forward
outputs: BaseModelOutputWithPast = self.model(
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\utils\generic.py", line 969, in wrapper
output = func(self, *args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 467, in forward
layer_outputs = decoder_layer(
File "D:\anaconda3\Lib\site-packages\transformers\modeling_layers.py", line 48, in __call__
return super().__call__(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 287, in forward
hidden_states, self_attn_weights = self.self_attn(
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\transformers\models\qwen3\modeling_qwen3.py", line 217, in forward
key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1771, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1782, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\Lib\site-packages\torch\nn\modules\linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
(Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\autograd\python_anomaly_mode.cpp:127.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "D:\PycharmProjects\Qwen3\bug_report.py", line 42, in <module>
loss.backward()
File "D:\anaconda3\Lib\site-packages\torch\_tensor.py", line 648, in backward
torch.autograd.backward(
File "D:\anaconda3\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
_engine_run_backward(
File "D:\anaconda3\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Function 'MmBackward0' returned nan values in its 0th output.
I tested the CUDA calculation errors of my 5080 and 2070 many times, and the results are very close
Versions:
PyTorch version: 2.8.0.dev20250613+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 专业版 (10.0.26100 64 位)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: True
CUDA runtime version: 12.9.86
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5080
Nvidia driver version: 576.52
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Name: AMD Ryzen 7 9700X 8-Core Processor
Manufacturer: AuthenticAMD
Family: 107
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 3800
MaxClockSpeed: 3800
L2CacheSize: 8192
L2CacheSpeed: None
Revision: 17408
Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy==1.11.2
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] numpydoc==1.7.0
[pip3] torch==2.8.0.dev20250613+cu128
[pip3] torchaudio==2.8.0.dev20250614+cu128
[pip3] torchvision==0.23.0.dev20250614+cu128
[pip3] triton-windows==3.3.1.post19
[conda] _anaconda_depends 2024.10 py312_mkl_0
[conda] blas 1.0 mkl
[conda] mkl 2023.1.0 h6b88ed4_46358
[conda] mkl-service 2.4.0 py312h2bbff1b_1
[conda] mkl_fft 1.3.10 py312h827c3e9_0
[conda] mkl_random 1.2.7 py312h0158946_0
[conda] numpy 1.26.4 py312hfd52020_0
[conda] numpy-base 1.26.4 py312h4dde369_0
[conda] numpydoc 1.7.0 py312haa95532_0
[conda] torch 2.8.0.dev20250613+cu128 pypi_0 pypi
[conda] torchaudio 2.8.0.dev20250614+cu128 pypi_0 pypi
[conda] torchvision 0.23.0.dev20250614+cu128 pypi_0 pypi
[conda] triton-windows 3.3.1.post19 pypi_0 pypi