Hi everyone,
I spent a few weeks debugging training crashes on my RTX 5090 and finally figured it out — sharing in case it helps anyone with the same GPU.
Setup
GPU: RTX 5090 Laptop (GB203, sm_120, 24GB VRAM)
OS: Windows 11
Framework: Style-Bert-VITS2 (VITS-based TTS model, DDP enabled by default)
Symptom
Inference worked perfectly, but training would segfault (0xC0000005 / ACCESS_VIOLATION) within seconds — no Python traceback, just a silent OS-level kill. I narrowed it down using checkpoint logging (print(flush=True) + file logging, since segfault kills stdout buffers):
CP7j: about to loss_slm.backward() ✅
CP7k: backward DONE ❌ (never reached)
The crash happened during backward() — specifically in gradient synchronization. Forward pass for all models (Generator, Discriminator, WavLM) completed fine every time.
I tried everything I could think of: disabling GradScaler, pure fp32 mode, CUDA_LAUNCH_BLOCKING=1, disabling WavLM loss — nothing helped. Even with WavLM disabled, the regular Discriminator backward also segfaulted at the same stage.
What I Tried — Two Paths
Path 1: PyTorch stable (2.4.1+cu121)
sm_120 native kernels don’t exist in stable, so PyTorch falls back to PTX JIT compilation. This actually works for basic operations — you get a UserWarning but execution continues:
UserWarning: NVIDIA GeForce RTX 5090 Laptop GPU with CUDA capability sm_120
is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 ... sm_90.
However, DDP’s broadcast_coalesced immediately fails:
RuntimeError: CUDA error: no kernel image is available for execution on the device
After patching DDP out (conditional skip when world_size == 1), I hit a cascade of no kernel image errors on progressively basic operations:
1. torch.nn.functional.embedding (emb_g) → "no kernel image" ❌
2. torch.nn.functional.conv1d (bert_proj) → "no kernel image" ❌
Each one could be individually patched with CPU fallback, but the cascade kept going — PTX JIT fallback doesn’t cover enough ops for stable training. I gave up on this path.
Path 2: PyTorch nightly (2.12.0.dev+cu128)
Nightly has sm_120 native kernels, so forward pass works without any warnings or fallbacks. But training segfaults during backward — and here’s where it gets interesting.
First attempt — just setting CUDA_VISIBLE_DEVICES=0 (no code changes):
set CUDA_VISIBLE_DEVICES=0
python train_ms_jp_extra.py ...
→ Same segfault. The training script still calls DDP(model) unconditionally.
Second attempt — patching the actual DDP wrapping out of the code:
→ Training runs stable ✅ (1000+ steps, no crash)
The environment variable alone wasn’t enough — the training script still wraps models with DDP() regardless of how many GPUs are visible. The fix required changing the source code to skip DDP() wrapping when world_size == 1.
Interesting finding: On stable, the absence of native kernels triggers PTX JIT fallback, which at least gives you a warning and tries to continue. On nightly, native kernels exist but the DDP/NCCL communication path is broken for sm_120 — and since native kernels exist, PyTorch uses them instead of falling back to PTX, leading to a harder crash (segfault with no Python traceback).
The Fix (2 patches to the training script)
Patch 1: Skip DDP wrapping on single GPU
# Before (applied to ALL DDP-wrapped models):
net_g = DDP(net_g, device_ids=[rank])
net_d = DDP(net_d, device_ids=[rank])
net_dur_disc = DDP(net_dur_disc, device_ids=[rank])
net_wd = DDP(net_wd, device_ids=[rank])
# After:
if torch.distributed.get_world_size() > 1:
net_g = DDP(net_g, device_ids=[rank])
net_d = DDP(net_d, device_ids=[rank])
net_dur_disc = DDP(net_dur_disc, device_ids=[rank])
net_wd = DDP(net_wd, device_ids=[rank])
Patch 2: Remove .module access
Without DDP wrapping, the model isn’t wrapped in a DistributedDataParallel container, so .module doesn’t exist. Every .module access needs to be changed to direct access:
# Before:
net_g.module.use_noise_scaled_mas
net_g.module.mas_noise_scale_initial
net_g.module.noise_scale_delta
net_g.module.current_mas_noise_scale
generator.module.infer(...)
# After:
net_g.use_noise_scaled_mas
net_g.mas_noise_scale_initial
net_g.noise_scale_delta
net_g.current_mas_noise_scale
generator.infer(...)
Important: You also need to set CUDA_VISIBLE_DEVICES=0 when launching training, to ensure the distributed backend initializes with a single process.
That’s it. Two patches + one environment variable.
Results
Batch size 8: 1.10–1.47 it/s
VRAM: 18.5GB / 24GB
Stable over thousands of steps — verified across multiple epochs
WavLM discriminator loss: works normally (was NOT the cause)
I initially suspected WavLM backward was the problem (it was the first backward to crash), but after removing DDP, WavLM backward runs fine. The segfault was always DDP’s gradient synchronization, not any specific model component.
Why This Only Affects Certain Frameworks
Other frameworks I tested on the same GPU worked fine from the start:
| Framework | DDP | Result |
|---|---|---|
| GPT-SoVITS v2/v4/v2Pro+ | Not used | |
| Applio RVC | Not used | |
| kohya_ss (SD training) | Accelerate auto-skips on single GPU | |
| Style-Bert-VITS2 | Enabled by default |
The pattern is clear: frameworks that don’t use DDP (or auto-skip it on single GPU) work fine on sm_120. DDP + sm_120 is the specific combination that crashes.
Environment
PyTorch: 2.12.0.dev (nightly) + cu128
Launch: set CUDA_VISIBLE_DEVICES=0
Patches: DDP conditional skip + .module removal
Hope this saves someone some debugging time. If anyone has more info on DDP behavior on Blackwell GPUs, I’d love to hear about it.